Using the nvidia-bug-report.log file to troubleshoot your system
NVIDIA provides a script that generates a log file that you can use to troubleshoot issues with NVIDIA GPUs. This log file has comprehensive information about your system, including information about individual devices, configuration of NVIDIA drivers, system journals, and more.
Generate the log file
To generate the log file, log in as the root user or use sudo, then run the following command:
sudonvidia-bug-report.sh
This script generates a zipped file called nvidia-bug-report.log.gz in the current directory. To verify that the script ran successfully, run ls -la nvidia* and look for a row similar to the following:
After you generate the log archive file, you can expand it and open the log in a text editor.
Troubleshoot the log file
The log file is comprehensive, as it collects information from various sources. The following are suggestions on where to start looking, depending on the issue you are seeing. You might see output in the log file from the same or a related Linux command you run on your system, or both.
Use the check-nvidia-bug-report shell script
To make the NVIDIA log report easier to use, Lambda provides a shell script, check-nvidia-bug-report.sh, that parses and summarizes the report. This script scans the report for:
After you generate the NVIDIA log file, run the Lambda script. The following example assumes the Lambda script and the NVIDIA log file are in the same directory:
./check-nvidia-bug-report.sh
Verify hardware with dmidecode
If you prefer to investigate the log file on your own, a good place to start is to check that all the hardware reported by the BIOS is installed, available, and seen by the system. Use dmidecode, or search for dmidecode in the log file.
An Xid message is an NVIDIA error report that prints to the kernel log or event log. Xid messages indicate that a general GPU error occurred, typically due to the driver programming the GPU incorrectly or by corruption of the commands sent to the GPU. The messages may indicate a hardware problem, an NVIDIA software problem, or an application problem. To understand the Xid message, read the NVIDIA documentation and review these common Xid errors.
NVIDIA drivers for NVSwitch report error conditions relating to NVSwitch hardware in kernel logs through a mechanism similar to Xids. SXid (or switch Xids) are errors relating to the NVIDIA switch hardware; they appear in kernel logs similar to Xids. For more information about SXids, read appendixes D.4 through D.7 in the NVIDIA documentation.
Search the log file for Xid or SXid and see what errors are associated with them, or run dmesg as root:
To check the systemd journal, either search for journalctl entries in the log file or run the journalctl command. You should see output similar to the following:
Aug 28 19:18:40 lambda-node jupyter[897]: [W 2024-08-28 19:18:40.342 ServerApp] ServerApp.token config is deprecated in 2.0. Use IdentityProvider.token.
Aug 28 19:18:40 lambda-node jupyter[897]: [W 2024-08-28 19:18:40.342 ServerApp] ServerApp.allow_password_change config is deprecated in 2.0. Use PasswordIde>
Aug 28 19:18:40 lambda-node jupyter[897]: [I 2024-08-28 19:18:40.349 ServerApp] Package jupyterlab took 0.0000s to import
Aug 28 19:18:40 lambda-node jupyter[897]: [I 2024-08-28 19:18:40.371 ServerApp] Package jupyter_collaboration took 0.0215s to import
Aug 28 19:18:40 lambda-node jupyter[897]: [I 2024-08-28 19:18:40.382 ServerApp] Package jupyter_lsp took 0.0103s to import
Aug 28 19:18:40 lambda-node jupyter[897]: [W 2024-08-28 19:18:40.382 ServerApp] A `_jupyter_server_extension_points` function was not found in jupyter_lsp. >
Aug 28 19:18:40 lambda-node jupyter[897]: [I 2024-08-28 19:18:40.382 ServerApp] Package jupyter_server_fileid took 0.0000s to import
Aug 28 19:18:40 lambda-node jupyter[897]: [I 2024-08-28 19:18:40.387 ServerApp] Package jupyter_server_terminals took 0.0047s to import
…
Check Your NVLink topology
To confirm that your NVLink topology is correct, search the log file for nvidia-smi nvlink --status output or run the following command:
nvidia-sminvlink--status
This command checks the status of each NVLink connection for each GPU. The output shows information about each NVLink, including the utilization and active or inactive status. It’s similar to the following truncated output for an eight-GPU system:
The nvidia-smi command can return a wealth of content. You can fine-tune your results to isolate a specific issue by using the -d (display) option along with the -q (query) option. For example, if you suspect a memory issue, you can choose to display only memory-related output:
UTILIZATION: displays GPU, memory, and encode/decode utilization rates, including sampling data with maximum, minimum, and average.
ECC: displays error correction code mode and errors.
TEMPERATURE: displays temperature data for the GPU.
POWER: displays power readings, including sampling data with maximum, minimum, and average.
CLOCK: displays data for all the clocks in the GPU, including sampling data with maximum, minimum, and average.
COMPUTE: displays the compute mode for the GPU.
PIDS: displays running processes.
PERFORMANCE: displays performance information for the GPU.
SUPPORTED_CLOCKS: displays the supported frequencies for the GPU clocks.
PAGE_RETIREMENT: when ECC is enabled, this option displays any framebuffer pages that have been dynamically retired.
ACCOUNTING: displays which processes are subject to accounting, how many processes are subject to accounting, and whether accounting mode is enabled.
You can specify multiple options by separating them with a comma. For example, the following command displays information about both memory and power usage:
If you can’t discover the cause for the issue you are experiencing, contact Lambda Support and generate then upload the Lambda bug report, which includes data from the NVIDIA bug report. For example: