DCGM | NVIDIA Data Center GPU Manager is a project | Monitoring library

 by   NVIDIA C++ Version: v3.1.8 License: Apache-2.0

kandi X-RAY | DCGM Summary

kandi X-RAY | DCGM Summary

DCGM is a C++ library typically used in Performance Management, Monitoring applications. DCGM has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA datacenter GPUs in cluster environments. It includes active health monitoring, comprehensive diagnostics, system alerts and governance policies including power and clock management. It can be used standalone by infrastructure teams and easily integrates into cluster management tools, resource scheduling and monitoring products from NVIDIA partners. DCGM simplifies GPU administration in the data center, improves resource reliability and uptime, automates administrative tasks, and helps drive overall infrastructure efficiency. DCGM supports Linux operating systems on x86_64, Arm and POWER (ppc64le) platforms. The installer packages include libraries, binaries, NVIDIA Validation Suite (NVVS) and source examples for using the API (C, Python and Go). DCGM integrates into the Kubernetes ecosystem by allowing users to gather GPU telemetry using dcgm-exporter. More information is available on DCGM's official page.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              DCGM has a low active ecosystem.
              It has 191 star(s) with 37 fork(s). There are 9 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 30 open issues and 33 have been closed. On average issues are closed in 51 days. There are 4 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of DCGM is v3.1.8

            kandi-Quality Quality

              DCGM has no bugs reported.

            kandi-Security Security

              DCGM has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              DCGM is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              DCGM releases are not available. You will need to build from source code and install.
              Installation instructions, examples and code snippets are available.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of DCGM
            Get all kandi verified functions for this library.

            DCGM Key Features

            No Key Features are available at this moment for DCGM.

            DCGM Examples and Code Snippets

            No Code Snippets are available at this moment for DCGM.

            Community Discussions

            QUESTION

            kube-prometheus-stack issue scraping metrics
            Asked 2021-Feb-19 at 10:50

            General Cluster Information:

            • Kubernetes version: 1.19.13
            • Cloud being used: private
            • Installation method: kubeadm init
            • Host OS: Ubuntu 20.04.1 LTS
            • CNI and version: Weave Net: 2.7.0
            • CRI and version: Docker: 19.3.13

            I am trying to get kube-prometheus-stack helm chart to work. This seems for most targets to work, however, some targets stay down as shown in the screenshot below.

            Are there any suggestions, how I can get kube-etcd, kube-controller-manager and kube-scheduler monitored by Prometheus?

            I deployed the helm chart as mentioned here and applied the suggestion here to get the kube-proxy monitored by Prometheus.

            Thanks in advance for any help!

            EDIT 1:

            ...

            ANSWER

            Answered 2021-Feb-19 at 10:50

            This is because Prometheus is monitoring wrong endpoints of those targets and/or targets don't expose metrics endpoint.

            Take controller-manager for example:

            1. Change bind-address (default: 127.0.0.1):

            Source https://stackoverflow.com/questions/65901186

            QUESTION

            On GKE, dcgm-exporter pod fails to run if the nvidia.com/gpu resource is not allocated
            Asked 2020-Nov-23 at 03:40

            I am trying to query GPU usage metrics of GKE pods.

            Here is what I've done for test:

            1. Created GKE cluster with two node pools, one of them has two cpu-only nodes and the other has one node with NVIDIA Tesla T4 GPU. All nodes are running Container-Optimized OS.
            2. As written in https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers, I ran kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml.
            3. kubectl create -f dcgm-exporter.yaml
            ...

            ANSWER

            Answered 2020-Nov-23 at 03:40

            It worked with these:

            1. Set privileged: true to securityContext.
            2. Add volume mount "nvidia-install-dir-host".

            Source https://stackoverflow.com/questions/64940013

            QUESTION

            ValueError: The number of observations cannot be determined on an empty distance matrix
            Asked 2019-Nov-21 at 14:15

            ANSWER

            Answered 2019-Nov-21 at 14:15

            passing empty matrix shows this exception. i resolved this error by passing value.

            Source https://stackoverflow.com/questions/58855138

            QUESTION

            Kubernetes: Run Pods only on EC2 Nodes that have GPUs
            Asked 2018-Jul-16 at 09:46

            I am setting up GPU monitoring on a cluster using a DaemonSet and NVIDIA DCGM. Obviously it only makes sense to monitor nodes that have a GPU.

            I'm trying to use nodeSelector for this purpose, but the documentation states that:

            For the pod to be eligible to run on a node, the node must have each of the indicated key-value pairs as labels (it can have additional labels as well). The most common usage is one key-value pair.

            I intended to check if the label beta.kubernetes.io/instance-type was any of those:

            ...

            ANSWER

            Answered 2018-Jul-16 at 09:46

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install DCGM

            DCGM installer packages are available on the CUDA network repository and DCGM can be easily installed using Linux package managers.
            The build image is stored in ./dcgmbuild.
            ensuring Docker is installed and running
            navigating to ./dcgmbuild
            running ./build.sh

            Support

            For information on platform support, getting started and using DCGM APIs, visit the official documentation repository.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/NVIDIA/DCGM.git

          • CLI

            gh repo clone NVIDIA/DCGM

          • sshUrl

            git@github.com:NVIDIA/DCGM.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link

            Explore Related Topics

            Consider Popular Monitoring Libraries

            netdata

            by netdata

            sentry

            by getsentry

            skywalking

            by apache

            osquery

            by osquery

            cat

            by dianping

            Try Top Libraries by NVIDIA

            DeepLearningExamples

            by NVIDIAJupyter Notebook

            FastPhotoStyle

            by NVIDIAPython

            vid2vid

            by NVIDIAPython

            TensorRT

            by NVIDIAC++