Skip to main content

GPU problems

Use this information to resolve problems that are related to GPUs in the compute tray.

Use one of the following commands to check the GPU health status. Make sure to update GPU driver, which includes the following utilities required. Latest driver can be found at Drivers and Software download website for Lenovo NVIDIA GB300 NVL72.

For more information about System Management Interface (SMI), see NVIDIA System Management Interface.

  • nvidia-smi

    Run the nvidia-smi command to display the four GPUs online.

    Figure 1. nvidia-smi
    nvidia-smi
  • nvidia-smi topo –p2p n

    Run the nvidia-smi topo –p2p n command to to display the internal connection status between GPUs within a single compute tray.
    Note
    An Unknown status for any GPU link indicates a potential hardware issue with the GPU, NVLink switch tray, or cable cartridge.
    Figure 2. nvidia-smi topo –p2p n
    nvidia-smi topo –p2p nL
  • nvidia-smi -q --id=1 -f <output file name>

    Run the nvidia-smi -q --id=1 -f <output file name> command to export GPU inventory information.

    Type the desired file name in <output file name> to store the output. For example: nvidia-smi -q --id=1 -f /tmp/queryoam1.txt.

    Figure 3. nvidia-smi -q --id=1 -f <output file name>
    ==============NVSMI LOG==============

    Timestamp : Mon Mar 30 02:14:58 2026
    Driver Version : 580.105.08
    CUDA Version : 13.0

    Attached GPUs : 4
    GPU 00000009:06:00.0
    Product Name : NVIDIA GB300
    Product Brand : NVIDIA
    Product Architecture : Blackwell
    Display Mode : Requested functionality has been deprecated
    Display Attached : No
    Display Active : Disabled
    Persistence Mode : Enabled
    Addressing Mode : ATS
    MIG Mode
    Current : Disabled
    Pending : Disabled
    Accounting Mode : Disabled
    Accounting Mode Buffer Size : 4000
    Driver Model
    Current : N/A
    Pending : N/A
    Serial Number : 1652725032738
    GPU UUID : GPU-29255b40-4ad2-6e15-a7e2-634503314135
    GPU PDI : 0xca89506c512681b3
    Minor Number : 1
    VBIOS Version : 97.10.4A.00.1F
    MultiGPU Board : No
    Board ID : 0x90600
    Board Part Number : 900-2G548-0081-000
    GPU Part Number : 31C2-893-A1
    FRU Part Number : N/A
    Platform Info
    Chassis Serial Number : 1822725187334
    Slot Number : 26
    Tray Index : 16
    Host ID : 1
    Peer Type : Switch Connected
    Module Id : 1
    GPU Fabric GUID : 0xca89506c512681b3
    Inforom Version
    Image Version : G548.0301.00.03
    OEM Object : 2.1
    ECC Object : 7.16
    Power Management Object : N/A
    Inforom BBX Object Flush
    Latest Timestamp : 2026/03/29 08:57:08.426
    Latest Duration : 56215 us
    GPU Operation Mode
    Current : N/A
    Pending : N/A
    GPU C2C Mode : Enabled