Nvidia GPU monitoring with Netdata
Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using nvidia-smi cli tool.
Requirements and Notes
You must have the
nvidia-smitool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about nvidia_smi.You must enable this plugin, as its disabled by default due to minor performance issues:
cd /etc/netdata # Replace this path with your Netdata config directory, if different
sudo ./edit-config python.d.confRemove the '#' before nvidia_smi so it reads:
nvidia_smi: yes.On some systems when the GPU is idle the
nvidia-smitool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue.Currently the
nvidia-smitool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: https://github.com/netdata/netdata/pull/4357Contributions are welcome.
Make sure
netdatauser can execute/usr/bin/nvidia-smior wherever your binary is.If
nvidia-smiprocess is not killed after netdata restart you need to offloop_mode.poll_secondsis how often in seconds the tool is polled for as an integer.
Charts
It produces the following charts:
- PCI Express Bandwidth Utilization in 
KiB/s - Fan Speed in 
percentage - GPU Utilization in 
percentage - Memory Bandwidth Utilization in 
percentage - Encoder/Decoder Utilization in 
percentage - Memory Usage in 
MiB - Temperature in 
celsius - Clock Frequencies in 
MHz - Power Utilization in 
Watts - Memory Used by Each Process in 
MiB - Memory Used by Each User in 
MiB - Number of User on GPU in 
num 
Configuration
Edit the python.d/nvidia_smi.conf configuration file using edit-config from the Netdata config
directory, which is typically at /etc/netdata.
cd /etc/netdata   # Replace this path with your Netdata config directory, if different
sudo ./edit-config python.d/nvidia_smi.conf
Sample:
loop_mode    : yes
poll_seconds : 1
exclude_zero_memory_users : yes
Was this page helpful?
Need further help?
Search for an answer in our community forum.
Contribute
- Join our community forum
 - Learn how to contribute to Netdata's open-source project
 - Submit a feature request