This snap includes NVIDIA DCGM and DCGM-Exporter to manage and monitor NVIDIA GPUs via the CLI or via Prometheus metrics.
Grafana dashboards can then be used to visualize the exported metrics, see for example:
https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/
The snap includes the following components:
Please see the links at the bottom of the page for more details about the included components and their purpose.
How-To
How to install the snap:
sudo snap install dcgm
How to enable metrics collection:
# Start the DCGM-Exporter service (disabled by default)
sudo snap start dcgm.dcgm-exporter
# Get the metrics
curl -s localhost:9400/metrics
How to configure the snap services:
The NV-Hostengine and DCGM-Exporter services can be configured via the snap
CLI.
For example:
# Get all the configuration options
sudo snap get dcgm
# Set the NV-Hostengine port
sudo snap set dcgm nv-hostengine-port=5577
# Restart the NV-Hostengine service to apply the changes
sudo snap restart dcgm.nv-hostengine
Reference
Available configurations options:
nv-hostengine-port
: the port on which the NV-Hostengine listens.
The default is 5555
.dcgm-exporter-address
: the address DCGM-Exporter binds to.
The default is :9400
.dcgm-exporter-metrics-file
: the name of a custom CSV metrics file to be loaded by the exporter.
The path is assumed to be /var/snap/dcgm/common/
.
The default metrics are located in /snap/dcgm/current/etc/dcgm-exporter/default-counters.csv
.
Please refer to the DCGM-Exporter repository link at the bottom of the page for more information on the CSV file format.Cryptography
During the snap build process, snapcraft downloads the CUDA keyring deb package using curl
over HTTPS and verifies its integrity using SHA256 checksums.
The CUDA keyring deb package is then used to set up the appropriate source for the DCGM deb package, whose signature is verified using the keyring.
For more information, see the CUDA keyring repository link and curl
documentation at the bottom of the page.
Links
Upstream DCGM-Exporter repository
https://github.com/NVIDIA/dcgm-exporter
Upstream DCGM repository
https://github.com/NVIDIA/DCGM
DCGM Documentation
https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/index.html
Available NVIDIA GPU metrics
https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html
Repository for the CUDA keyring and DCGM deb package
https://developer.download.nvidia.com/compute/cuda/repos/
curl Documentation
https://curl.se/docs/manpage.html
Thank you for your report. Information you provided will help us investigate further.
There was an error while sending your report. Please try again later.
You are about to open
Do you wish to proceed?
Snaps are applications packaged with all their dependencies to run on all popular Linux distributions from a single build. They update automatically and roll back gracefully.
Snaps are discoverable and installable from the Snap Store, an app store with an audience of millions.
Snap can be installed on Pop!_OS from the command line. Open Terminal from the Applications launcher and type the following:
sudo apt update
sudo apt install snapd
Either log out and back in again, or restart your system, to ensure snap’s paths are updated correctly.
To install NVIDIA DCGM, simply use the following command:
sudo snap install dcgm
Browse and find snaps from the convenience of your desktop using the snap store snap.
Interested to find out more about snaps? Want to publish your own application? Visit snapcraft.io now.