Lenovo SD665_V3 server
This page contains information about Lenovo SD665_V3 servers deployed in our cluster. The Lenovo ThinkSystem SD665_V3 is a 2-socket ½U server that features the AMD EPYC 9004 “Genoa” family of processors.
The nodes are housed in the upgraded ThinkSystem DW612S enclosure with SMM2 management module. See the SMM2 page with SMM2 functions and IPMItool commands for managing the SMM2.
To offer solution-level interoperability support for HPC and AI configurations based on the Lenovo ThinkSystem portfolio and OEM components, Lenovo_EveryScale extensively tests the components and their combinations. The extensive testing results in a Best_Recipe release of software and firmware levels. Lenovo warrants Best_Recipe components to work seamlessly together as a fully integrated data center solution instead of a collection of individual components at the time of implementation.
Documentation and software
Lenovo provides SD665_V3 information and downloads:
There is a Product Home page for downloads.
The EasyBuild software module OpenMPI seems to have issues with the Mellanox libraries. Setting these variables may be a workaround:
export OMPI_MCA_btl='^openib,ofi'
export OMPI_MCA_mtl='^ofi'
Booting and BIOS configuration
See the Lenovo BIOS settings common to servers page.
See the Lenovo XClarity (XCC) BMC page.
There is a document Lenovo ThinkSystem SR645 Recommended UEFI and OS settings for Lenovo Scalable Infrastructure (LeSI) which recommends:
For best performance set to Maximum Performance first, then set to Custom Mode
OFED software and drivers
The OpenFabrics Enterprise Distribution (OFED) is open-source software for RDMA and kernel bypass applications, as provided by the OpenFabrics Alliance. Mellanox provides some information about Inbox_drivers from various OS vendors, but it is not stated whether they can be used in place of the drivers from Mellanox described below.
NVIDIA offers a Linux MLNX OFED repository which is enabled by:
Install key:
rpm --import https://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
Add the desired repo, for example:
cd /etc/yum.repos.d/ wget https://linux.mellanox.com/public/repo/mlnx_ofed/latest/rhel8.10/mellanox_mlnx_ofed.repo dnf clean all
Install driver packages: TBD?
Nvidia’s Red Hat Enterprise Linux (RHEL) Inbox Driver documentation has the statement:
Warning
ConnectX-7 is only supported as technical preview (i.e., the feature is not fully supported for production).
Since the SD665_V3 nodes have ConnectX-7
adapters, these are NOT SUPPORTED at present!
Install these prerequisite packages:
dnf -y install libibverbs rdma libmlx4 libibverbs-utils infiniband-diags librdmacm librdmacm-utils ibacm
dnf -y install tk gcc-gfortran kernel-modules-extra
For the Mellanox Infiniband adapters it is recommended to download the .tar.gz file from Mellanox OpenFabrics Enterprise Distribution for Linux (MLNX_OFED). Unpack the tar-ball and run the installer, for example:
tar xzf MLNX_OFED_LINUX-24.01-0.3.3.1-rhel8.9-x86_64.tgz
cd MLNX_OFED_LINUX-24.01-0.3.3.1-rhel8.9-x86_64
./mlnxofedinstall
The installer script has some options:
./mlnxofedinstall --help
./mlnxofedinstall -q # Set quiet - no messages will be printed
yes | ./mlnxofedinstall # Answer yes to all questions
The installer attempts to make firmware updates, but we may experience this warning:
Attempting to perform Firmware update...
The firmware for this device is not distributed inside Mellanox driver: 42:00.0 (PSID: LNV0000000049)
To obtain firmware for this device, please contact your HW vendor.
Failed to update Firmware.
so it may be a good idea to add this flag and omit firmware updates:
./mlnxofedinstall --without-fw-update
Installation instructions are in the User Manual from the Mellanox documentation.
Verify that the Mellanox driver RPMs have been installed and the openibd
service started:
rpm -qa | grep mlnx
systemctl status openibd
Verify the installed OFED package name and version:
ofed_info -s
If your kernel version does not match with any of the offered pre-built RPMs,
you can add your kernel version by using the mlnx_add_kernel_support.sh
script located inside the MLNX_OFED package.
Notices:
On Redhat and SLES distributions with errata kernel installed there is no need to use the
mlnx_add_kernel_support.sh
script. The regular installation can be performed and weak-updates mechanism will create symbolic links to the MLNX_OFED kernel modules.OFED software includes kernel modules for the running kernel, and these must be rebuilt if the kernel is upgraded!