Skip to main content

Collecting service data

To clearly identify the root cause of a rack solution issue or at the request of Lenovo Support, you might need collect service data that can be used for further analysis. Service data includes information such as event logs and hardware inventory.

Service data can be collected through the following tools:

Compute tray BMC FFDC logs

  1. Navigate to the Maintenance > Save Server Data page within the compute tray BMC.
  2. Click Download Server Data to download FFDC log information for problem escalation. These logs contain the following data for a single compute tray:
    • System inventory information
    • System event logs (SEL)
    • Sensor status
Figure 1. Compute tray BMC FFDC logs
Compute tray BMC FFDC logs

NVDebug logs

Use the NVDebug tool to gather OOB diagnostic logs from one or more compute trays, NVLink switch trays, or power shelves. The tool interfaces with the device BMC to capture data for escalation. Before execution, update the configuration file (included in the tool package) with the target device connection information.

  1. Connect a client device via a hub to the RJ-45 OS management port (1) and RJ-45 BMC management port (2) on a compute tray using two cables to establish the tray as the NVDebug host.
    RJ-45 BMC and OS management ports
  2. Download the tool from NVOnline and copy it to the client device.

  3. Unzip the package with the following two commands:
    sudo unzip <NV Debug tool file name>.zip
    sudo tar zxvf nvdebug-linux-arm64-<version name>.tar.gz
    Note
    Use the AMD64 for laptop environments.
  4. Edit the tool_config.yaml file with the following commands:
    vim tool_config.yaml
    SKIP_BMC_SSH_LOGS: false
  5. Choose to run the debug tool on either a single compute tray or multiple trays simultaneously:
    • Single compute tray:Execute the following command:
      sudo ./nvdebug collect -i <target node BMC IP address> -u <BMC Username> -p <BMC Password> -b "GB300 NVL" -I <target node OS IP address> -U <OS Username> -H <OS Password> -r <BMC Username> -w <BMC Password>"
      Note
      See the NVDebug User Guide (located in the tool’s ZIP file) for complete platform parameter definitions.
    • Multiple compute trays simultaneously:
      1. Prepare the configuration files first. These files are located in the unzipped tool package folder. Use a text editor to update the following:
        • config.yaml
        • dut_config.yaml
      2. In the config.yaml file, set PLATFORM to arm64 and TargetBaseboard to the corresponding device type. Use the following values:
        • GB300 NVL for compute trays
        • GB300 NVSwitchTray for NVLink switch trays
        • PowerShelfController for power shelves
        Then, set SKIP_BMC_SSH_LOGS to false.

      3. In the config.yaml file, update the BMC IP address and credentials for target compute tray.

      4. Run the following five commands on the host device to set up the environment:
        sudo apt install ipmitool nvme-cli pciutils dmidecode lshw opensm
        sudo apt install nvidia-fabricmanager-580
        ssh-keygen -t rsa-b 4096 -f ~/.ssh/nvdebug_key
        ssh-copy-id -i ~/.ssh/nvdebug_key.pub nvidia@<host IP address>
        ssh-copy-id -i ~/.ssh/nvdebug_key.pub sysadmin@<target node BMC IP address>
      5. Run the following two commands on the client device to set up the host environment:
        sudo apt install ipmitool nvme-cli pciutils dmidecode lshw opensm nvlsm
        sudo apt install nvidia-fabricmanager-580
      6. Run the following command to edit the OpenSSH server daemon configuration file:
        sudo vim /etc/ssh/sshd_config
        Then, Add the following parameters to the file:
        • PubkeyAuthenticationyes
        • AuthorizedKeysFile.ssh/authorized_keys


      7. On the host, execute the following command to edit the sudo configuration file:
        sudo vim /etc/sudoers
        Add the following entry to the file:
        nvidiaALL=(ALL) NOPASSWD:ALL


      8. Execute the NVDebug tool with the following command:
        sudo ./nvdebug -i <BMC IP address> -u <username> -p <password> -t arm64 -b “<hardware platform>" -I <host IP address> -U nvidia-H nvidia -o .


        Note
        Log collection for a single compute tray takes approximately ten minutes to complete. Upon completion, the tool generates a ZIP file within its directory; use this file for problem escalation.