DCGM | NVIDIA Data Center GPU Manager is a project | Monitoring library

by NVIDIA C++ Version: v3.1.8 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(4)Vulnerabilities Install Support

kandi X-RAY | DCGM Summary

DCGM is a C++ library typically used in Performance Management, Monitoring applications. DCGM has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. It can be used standalone by infrastructure teams and easily integrates into cluster management tools, resource scheduling and monitoring products from NVIDIA partners. DCGM simplifies GPU administration in the data center, improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency. DCGM supports Linux operating systems on x86_64, Arm and POWER (ppc64le) platforms. The installer packages include libraries, binaries, NVIDIA Validation Suite (NVVS) and source examples for using the API (C, Python and Go). DCGM integrates into the Kubernetes ecosystem by allowing users to gather GPU telemetry using dcgm-exporter. More information is available on DCGM's official page.

Support

Quality

Security

License

Reuse

Support

DCGM has a low active ecosystem.

It has 191 star(s) with 37 fork(s). There are 9 watchers for this library.

It had no major release in the last 6 months.

There are 30 open issues and 33 have been closed. On average issues are closed in 51 days. There are 4 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of DCGM is v3.1.8

Quality

DCGM has no bugs reported.

Security

DCGM has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

DCGM is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

DCGM releases are not available. You will need to build from source code and install.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of DCGM

Get all kandi verified functions for this library.

DCGM Key Features

No Key Features are available at this moment for DCGM.

DCGM Examples and Code Snippets

No Code Snippets are available at this moment for DCGM.

Community Discussions

Trending Discussions on DCGM

kube-prometheus-stack issue scraping metrics

On GKE, dcgm-exporter pod fails to run if the nvidia.com/gpu resource is not allocated

ValueError: The number of observations cannot be determined on an empty distance matrix

Kubernetes: Run Pods only on EC2 Nodes that have GPUs

QUESTION

kube-prometheus-stack issue scraping metrics

Asked 2021-Feb-19 at 10:50

General Cluster Information:

Kubernetes version: 1.19.13
Cloud being used: private
Installation method: kubeadm init
Host OS: Ubuntu 20.04.1 LTS
CNI and version: Weave Net: 2.7.0
CRI and version: Docker: 19.3.13

I am trying to get kube-prometheus-stack helm chart to work. This seems for most targets to work, however, some targets stay down as shown in the screenshot below.

Are there any suggestions, how I can get kube-etcd, kube-controller-manager and kube-scheduler monitored by Prometheus?

I deployed the helm chart as mentioned here and applied the suggestion here to get the kube-proxy monitored by Prometheus.

Thanks in advance for any help!

EDIT 1:

...

ANSWER

Answered 2021-Feb-19 at 10:50

This is because Prometheus is monitoring wrong endpoints of those targets and/or targets don't expose metrics endpoint.

Take controller-manager for example:

Change bind-address (default: 127.0.0.1):

Source https://stackoverflow.com/questions/65901186

QUESTION

On GKE, dcgm-exporter pod fails to run if the nvidia.com/gpu resource is not allocated

Asked 2020-Nov-23 at 03:40

I am trying to query GPU usage metrics of GKE pods.

Here is what I've done for test:

Created GKE cluster with two node pools, one of them has two cpu-only nodes and the other has one node with NVIDIA Tesla T4 GPU. All nodes are running Container-Optimized OS.
As written in https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers, I ran kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml.
kubectl create -f dcgm-exporter.yaml

...

ANSWER

Answered 2020-Nov-23 at 03:40

It worked with these:

Set privileged: true to securityContext.
Add volume mount "nvidia-install-dir-host".

Source https://stackoverflow.com/questions/64940013

QUESTION

ValueError: The number of observations cannot be determined on an empty distance matrix

Asked 2019-Nov-21 at 14:15

I have this code which finds the clusteFile "..//src/clusterFaces.py", line 143, in main Z = linkage(distances, method='complete')r face detection error in

...

ANSWER

Answered 2019-Nov-21 at 14:15

passing empty matrix shows this exception. i resolved this error by passing value.

Source https://stackoverflow.com/questions/58855138

QUESTION

Kubernetes: Run Pods only on EC2 Nodes that have GPUs

Asked 2018-Jul-16 at 09:46

I am setting up GPU monitoring on a cluster using a DaemonSet and NVIDIA DCGM. Obviously it only makes sense to monitor nodes that have a GPU.

I'm trying to use nodeSelector for this purpose, but the documentation states that:

For the pod to be eligible to run on a node, the node must have each of the indicated key-value pairs as labels (it can have additional labels as well). The most common usage is one key-value pair.

I intended to check if the label beta.kubernetes.io/instance-type was any of those:

...

ANSWER

Answered 2018-Jul-16 at 09:46

Node Affinity was the solution:

Source https://stackoverflow.com/questions/51337051

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install DCGM

DCGM installer packages are available on the CUDA network repository and DCGM can be easily installed using Linux package managers.
The build image is stored in ./dcgmbuild.
ensuring Docker is installed and running
navigating to ./dcgmbuild
running ./build.sh