GPU problems
Use this information to resolve problems that are related to GPUs.
Health check for GPUs
Use one of the following utilities to check the GPU health status. Make sure to update GPU driver, which includes the following utilities required. Latest driver can be found at Drivers and Software download website for ThinkSystem SR680a V4.
For more information about System Management Interface (SMI) information, see NVIDIA System Management Interface.
nvidia-smi
Run the nvidia-smi utility to display the eight GPUs online.
Figure 1. nvidia-smi
nvidia-smi -L
Run the nvidia-smi -L utility to display the eight GPUs online with UUID.
Figure 2. nvidia-smi -L
nvidia-smi -q --id=1 -f <output file name>
Run the nvidia-smi -q --id=1 -f <output file name> utility to export GPU inventory information.
Type the desired file name in <output file name> to store the output. For example: nvidia-smi -q --id=1 -f /tmp/queryoam1.txt.
Figure 3. nvidia-smi -q --id=1 -f <output file name>
nvidia-smi --id=0 -q -d ECC,PAGE_RETIREMENT
Run the nvidia-smi --id=0 -q -d ECC,PAGE_RETIREMENT utility to export ECC (Error Checking and Correction) errors and status of retired pages.
ECC Mode
Current : Enabled
Pending : Enabled
Ecc Errirs
Volatile
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable: : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
SRAM Threshold Exceeded : No
Aggregate Uncorrectable SRAM Sources
SRAM L2 : 0
SRAM SM : 0
SRAM Microcontroller : 0
SRAM PCIE : 0
SRAM Other : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/Anvidia-smi pci --getErrorCounters
Run the nvidia-smi pci --getErrorCounters utility to display error counters of the eight GPUs.
Figure 4. nvidia-smi pci --getErrorCounters
nvidia-smi pci --getErrorCounters --id=<id number>
Run the nvidia-smi pci --getErrorCounters --id=<id number> utility to display error counters of a specific GPU.
Type the ID number of a specific GPU in <id number>. For example: nvidia-smi pci --getErrorCounters --id=2.
Figure 5. nvidia-smi pci --getErrorCounters --id=<id number>