Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

Pool resize with NC24rs_v3 fails to find PKEYS during nodeprep #360

Open
themorey opened this issue Nov 4, 2020 · 1 comment
Open

Pool resize with NC24rs_v3 fails to find PKEYS during nodeprep #360

themorey opened this issue Nov 4, 2020 · 1 comment

Comments

@themorey
Copy link

themorey commented Nov 4, 2020

Problem Description

Creating a multi-instance pool with NC24rs_v3 fails during start prep as it is looking for the mlx5_0 in shipyard_nodeprep.sh lines 1609-1612:

export_ib_pkey()
{
    key0=$(cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/0)
    key1=$(cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/1)

The NC24rs_v3 has the ConnectX3 card and is identified as mlx4_0 not mlx5_0. Manually modifying shipyard_nodeprep.sh each time a pool is created will workaround the issue.

Batch Shipyard Version

3.9.1 (Mac)

Steps to Reproduce

Resize a multi-instance pool containing NC24rs_v3 and wait for it to fail.

Expected Results

Node finds the PKEYS and boots normally without intervention.

Actual Results

Manual intervention is required each time a pool is created or modified.

Redacted Configuration

 pool_specification:
    id: arvinas-relion-pool-NCv3
    vm_configuration:
      platform_image:
       offer: CentOS-HPC
       publisher: OpenLogic
       sku: '7.7'
       version: '7.7.2020062600'
   vm_count:
     dedicated: 0
     low_priority: 0
   vm_size: STANDARD_NC24rs_v3
   autoscale:
     evaluation_interval: 00:05:00
     scenario:
       name: active_tasks
       maximum_vm_count:
         dedicated: 4
         low_priority: 4
       maximum_vm_increment_per_evaluation:
         dedicated: -1
         low_priority: -1
       bias_node_type: low_priority
   inter_node_communication_enabled: true
   virtual_network:
     arm_subnet_id: /subscriptions/{sub}/resourceGroups/{RG}/providers/Microsoft.Network/virtualNetworks/{Vnet}/subnets/{sn}
   ssh:
     username: shipyard
@themorey
Copy link
Author

themorey commented Nov 5, 2020

It looks like the environment variable SHIPYARD_USER_CMD in the file .shipyard.envlist is also hardcoded as UCX_NET_DEVICES=mlx5_0:1. This causes multinode MPI jobs to fail with Gen1 VMs that have mlx4 devices.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant