Skip to main content

GPU problems

Use this information to resolve problems that are related to GPUs.

Health check for GPUs

Note
  • nvidia-smi

    Run the nvidia-smi utility to display the eight GPUs online.

    Figure 1. nvidia-smi
    nvidia-smi
  • nvidia-smi -L

    Run the nvidia-smi -L utility to display the eight GPUs online with UUID.

    Figure 2. nvidia-smi -L
    nvidia-smi -L
  • nvidia-smi -q --id=1 -f <output file name>

    Run the nvidia-smi -q --id=1 -f <output file name> utility to export GPU inventory information.

    Type the desired file name in <output file name> to store the output. For example: nvidia-smi -q --id=1 -f /tmp/queryoam1.txt.

    Figure 3. nvidia-smi -q --id=1 -f <output file name>
    nvidia-smi -q --id=1 -f <output file name>
  • nvidia-smi --id=0 -q -d ECC,PAGE_RETIREMENT

    Run the nvidia-smi --id=0 -q -d ECC,PAGE_RETIREMENT utility to export ECC (Error Checking and Correction) errors and status of retired pages.

    ECC Mode
    Current : Enabled
    Pending : Enabled
    Ecc Errirs
    Volatile
    SRAM Correctable : 0
    SRAM Uncorrectable Parity : 0
    SRAM Uncorrectable SEC-DED : 0
    DRAM Correctable : 0
    DRAM Uncorrectable: : 0
    Aggregate
    SRAM Correctable : 0
    SRAM Uncorrectable Parity : 0
    SRAM Uncorrectable SEC-DED : 0
    DRAM Correctable : 0
    DRAM Uncorrectable : 0
    SRAM Threshold Exceeded : No
    Aggregate Uncorrectable SRAM Sources
    SRAM L2 : 0
    SRAM SM : 0
    SRAM Microcontroller : 0
    SRAM PCIE : 0
    SRAM Other : 0
    Retired Pages
    Single Bit ECC : N/A
    Double Bit ECC : N/A
    Pending Page Blacklist : N/A
  • nvidia-smi pci --getErrorCounters

    Run the nvidia-smi pci --getErrorCounters utility to display error counters of the eight GPUs.

    Figure 4. nvidia-smi pci --getErrorCounters
    nvidia-smi pci --getErrorCounters
  • nvidia-smi pci --getErrorCounters --id=<id number>

    Run the nvidia-smi pci --getErrorCounters --id=<id number> utility to display error counters of a specific GPU.

    Type the ID number of a specific GPU in <id number>. For example: nvidia-smi pci --getErrorCounters --id=2.

    Figure 5. nvidia-smi pci --getErrorCounters --id=<id number>
    nvidia-smi pci --getErrorCounters --id=<id number>