To offer solution-level interoperability support for HPC and AI configurations based on the Lenovo ThinkSystem portfolio and OEM components,
Lenovo_EveryScale extensively tests the components and their combinations.
The extensive testing results in a EveryScale_Best_Recipes release of software and firmware levels.
Lenovo warrants EveryScale_Best_Recipes components to work seamlessly together as a fully integrated data center solution instead of a collection of individual components at the time of implementation.
The SD665_V3 has a water-cooled NVIDIA 2-Port PCIe Gen5 x16 InfiniBand Adapter (SharedIO)
ThinkSystem NVIDIA ConnectX-7 NDR200 InfiniBand QSFP112 Adapters.
The adapter is located in the right-hand SD665_V3 node and connects both servers in the tray.
https://datacentersupport.lenovo.com/de/en/solutions/ht510888
There is important information regarding SharedIO for older SD650 servers in the article
Considerations when using ThinkSystem SD650, SD650 V2, SD650 V3 and ConnectX-6 HDR, ConnectX-7 NDR SharedIO.
The issues have apparently been resolved in the SD665_V3 system.
Please note that several Infiniband tools such as ibnetdiscover
fail with an error message when executed on the SD665_V3 “auxiliary” (left-hand) node,
and you must execute such tools on the “primary” (right-hand) node (private communication with a Lenovo support person).
For debugging purposes, Mellanox provides a linux-sysinfo-snapshot tool which
is designed to take a snapshot of all the configuration and relevant information on the server and Mellanox’s adapters.
Mellanox Infiniband and Ethernet software and firmware MLNX_OFED / MLNX_EN for Lenovo must be downloaded from the special NVIDIA_Lenovo_EveryScale site.
Click on the Firmware
tab to download the latest firmware.
Older firmware can be downloaded from the Lenovo_Archive.
The Lenovo Mellanox adapters’ firmware must be updated with the special Lenovo firmware executable, for example:
mlxfwmanager_LES_24B_OFED-24.10-1_build5
Adding the --query
flag will display firmware versions.
WARNING:
There seems to be an undocumented restriction that node Virtual_Reseat (performed virtually using the SMM2 module)
are required whenever SharedIO adapter firmwares are updated!!
Both the left and right nodes of a tray have to be reseated simultaneously!
The node Virtual_Reseat may be performed in several alternative ways:
This command displays the NVIDIA/Mellanox firmware version:
ibv_devinfo | grep fw_ver
This standard Mellanox drivers tool also reports firmware versions:
Updating networking firmware from a repository folder XXX/ can be done from the XCC GUI, or by using a OneCLI command like this example:
onecli update flash --nocompare --includeid mlnx-lnvgy_fw_nic_cx-j9m3u-0302_anyos_comp --dir XXX/ --log=5 -N --output /tmp/logs
This will loop over all firmwares in the repository and try to apply them one by one.
To select only a specific firmware family: TBD
It is really cumbersome to update the SharedIO Mellanox Infiniband firmware!
If you reboot the right-hand node or update the Mellanox adapters’ firmware,
the Infiniband network interface on the left-hand node will disappear :-(
Therefore we have developed the following procedure:
Firstly, the right-hand (SharedIO Primary) nodes which hold the physical adapter are updated.
Secondly, the left-hand (SharedIO Auxiliary) nodes which only have a cable connection to the physical adapter are updated.
Finally Virtual Reseat operations must be made to remove power completely from both nodes and thus reinitialize the adapters.
Virtual Reseat is a feature of the System Management Module (SMM2) which simulates physically removing the node from AC power and reconnecting the node to AC power.
In the SMM2 Enclosure Rear Overview page the entire enclosure can be reseated at once.
The Lenovo EveryScale_Best_Recipes lists the latest available firmware and software versions.
When managing separately the left-hand and right-hand nodes in a Lenovo compute tray,
the ClusterShell_tool comes in handily for selecting subsets of nodes.
Let us assume that nodes are named numerically so that left-hand nodes have odd numbers,
whereas right-hand nodes have even numbers, for example:
e001,e002,...,e023,e024 # left,right,...,left,right
The clush command can now perform commands separately:
clush -bw e[001-023/2] echo I am a left-hand node
clush -bw e[002-024/2] echo I am a right-hand node
Unfortunately, Slurm doesn’t recognize this syntax of node number increments.
Here you can use the ClusterShell_tool’s command nodeset to print Slurm compatible nodelists to be used as Slurm command arguments:
$ nodeset -f e[001-024/2]
e[001,003,005,007,009,011,013,015,017,019,021,023]
$ nodeset -f e[002-024/2]
e[002,004,006,008,010,012,014,016,018,020,022,024]
An example where we assign nodelists to variables:
$ export nodelist=e[001-024]
$ export left=`nodeset -f e[001-024/2]`
$ export right=`nodeset -f e[002-024/2]`
$ sinfo -n $left
All trays/pairs of SD665_V3 nodes must be upgraded together because of the SharedIO adapter.
Make a Slurm system reservation of the nodes or drain the nodes in Slurm,
so they don’t run any jobs before you proceed to the next step.
It is a good idea to update Linux OS software (including kernel), UEFI and XCC/BMC firmware when the nodes are down anyway.
You may find the update.sh script useful for automating this process.
First select to update the right-hand (SharedIO Primary) nodes fully, possibly using the update.sh script.
Notes:
Do not update or shut down the left-hand nodes!
You must wait until the physical adapters in the right-hand (SharedIO Primary) nodes have been updated.
Remember that you can select the left-hand and right-hand nodenames (<nodelist>) as shown in the above section using the nodeset command.
Update all OS software and firmwares including the Mellanox mlxfwmanager_LES_24B_OFED-24.10-1_build5
(or newer) firmware update.
Reboot the right-hand nodes, and then check that OS kernel, UEFI, and XCC/BMC have the correct versions, for example:
clush -bw <nodelist> 'uname -r; dmidecode -s bios-version; ipmitool bmc info|grep Firmware'
Check the Mellanox firmware version using the tool discussed above:
clush -bw <nodelist> <some-path>/mlxfwmanager_LES_24B_OFED-24.10-1_build5 --query
Check that you have Status: Up to date.
The Mellanox FW (Running) firmware is probably still outdated at this stage and until you have made Virtual Reseat operations!
Then select to update the left-hand (SharedIO Auxiliary) nodes fully like in item 3.
After both right-hand and left-hand nodes have been successfully updated, except for the Mellanox FW (Running) firmware,
then shut down all the nodes:
clush -bw <nodelist> shutdown -h now
Now make Virtual Reseat of all the nodes using the Lenovo System Management Module 2 (SMM2) web GUI interface.
This will activate the new Mellanox firmware when nodes are powered up again.
Note: If any nodes are having errors (PCIe adapter, BMC, etc.),
it is recommended to shutdown the nodes and make a physical reseat of the tray.
The experience is that physically disconneting the tray from the DW612S chassis is more thorough than Lenovo’s recommended Virtual Reseat.
Power up all the right-hand (SharedIO Primary) nodes.
If using IPMI this may be performed using the power_ipmi script, for example:
power_ipmi -r e002,e004,e006,e008
When the right-hand (SharedIO Primary) nodes are up again,
check the Mellanox firmware version:
mlxfwmanager_LES_24B_OFED-24.10-1_build5 --query
If the Current (Running) firmware is the same as the installed Available firmware, the upgrade was successful :-)
Power up all the left-hand (SharedIO Auxiliary) nodes like in item 7.
Check the Current (Running) firmware like in item 8.
If all firmwares are now up-to-date, you may return the nodes to Slurm production.
See the Lenovo BIOS settings common to servers page.
See the Lenovo XClarity (XCC) BMC page.
There is a document
Lenovo ThinkSystem SR645 Recommended UEFI and OS settings for Lenovo Scalable Infrastructure (LeSI)
which recommends:
The OpenFabrics Enterprise Distribution (OFED) is open-source software for RDMA and kernel bypass applications,
as provided by the OpenFabrics Alliance.
Mellanox provides some information about Inbox_drivers from various OS vendors,
but it is not stated whether they can be used in place of the drivers from NVIDIA/Mellanox described below.
IMPORTANT: The NVIDIA Red Hat Enterprise Linux (RHEL) 8.10 Driver Documentation
has the statement:
Warning
ConnectX-7 is only supported as technical preview (i.e., the feature is not fully supported for production).
Since the SD665_V3 nodes have ConnectX-7
adapters, these are NOT SUPPORTED by the Inbox_drivers of RHEL drivers at present!
NVIDIA offers a Linux MLNX OFED repository which is enabled by:
Install key:
rpm --import https://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
Add the desired repo, for example:
cd /etc/yum.repos.d/
wget https://linux.mellanox.com/public/repo/mlnx_ofed/latest/rhel8.10/mellanox_mlnx_ofed.repo
dnf clean all
Install driver packages: TBD?
Install these prerequisite packages:
dnf -y install libibverbs rdma libmlx4 libibverbs-utils infiniband-diags librdmacm librdmacm-utils ibacm
dnf -y install tk gcc-gfortran kernel-modules-extra
For the Mellanox Infiniband adapters it is recommended to download the .tar.gz file from
Mellanox OpenFabrics Enterprise Distribution for Linux (MLNX_OFED).
Unpack the tar-ball and run the installer, for example:
tar xzf MLNX_OFED_LINUX-24.01-0.3.3.1-rhel8.9-x86_64.tgz
cd MLNX_OFED_LINUX-24.01-0.3.3.1-rhel8.9-x86_64
./mlnxofedinstall
The installer script has some options:
./mlnxofedinstall --help
./mlnxofedinstall -q # Set quiet - no messages will be printed
yes | ./mlnxofedinstall # Answer yes to all questions
The installer attempts to make firmware updates, but we may experience this warning:
Attempting to perform Firmware update...
The firmware for this device is not distributed inside Mellanox driver: 42:00.0 (PSID: LNV0000000049)
To obtain firmware for this device, please contact your HW vendor.
Failed to update Firmware.
so it may be a good idea to add this flag and omit firmware updates:
./mlnxofedinstall --without-fw-update
Installation instructions are in the User Manual from the Mellanox documentation.
Verify that the Mellanox driver RPMs have been installed and the openibd
service started:
rpm -qa | grep mlnx
systemctl status openibd
Verify the installed OFED package name and version:
If your kernel version does not match with any of the offered pre-built RPMs,
you can add your kernel version by using the mlnx_add_kernel_support.sh
script located inside the MLNX_OFED package.
Notices:
On Redhat and SLES distributions with errata kernel installed there is no need to use the mlnx_add_kernel_support.sh
script.
The regular installation can be performed and weak-updates mechanism will create symbolic links to the MLNX_OFED kernel modules.
OFED software includes kernel modules for the running kernel, and these must be rebuilt if the kernel is upgraded!