5 Replies Latest reply on Oct 7, 2013 11:25 AM by VincentSC

    Monitoring GPU utilization

    vbrik

      Hello,

       

      I am an administrator of a Linux cluster with about 100 GPUs, and I need to find a way to monitor their utilization. For example, the fraction of time a GPU is actually busy, how much memory is being used, etc.

       

      Is there a utility similar to nvidia-smi for Radeon graphics cards that would show GPU usage statistics?

       

      Does OpenCL provide an interface that could be used to monitor GPU utilization? For example, can I poll CL_DEVICE_AVAILABLE provided by clGetDeviceInfo to estimate how frequently a GPU is busy?

       

       

      Thanks,

       

      Vlad

        • Re: Monitoring GPU utilization
          nou

          try run "amdconfig --odgc" and "amdconfig --odgt"

            • Re: Monitoring GPU utilization
              VincentSC

              Thanks, that's useful!

               

              Under Linux, to only get the numbers:

              amdconfig --adapter=$1 --odgt | grep 'Temperature' | cut -d'-' -f2 | cut -d'.' -f1 | tr -d ' '

              amdconfig --adapter=$1 --odgc | grep 'GPU load' | cut -f1 -d'%' | cut -f2 -d':'| tr -d ' '

               

              Perfect to use in any monitor script. MRTG-config for Load is as follows:

              --------------------

              WorkDir: /var/www/mrtg/
              Refresh: 300
              Interval: 5

              RunAsDaemon: Yes

              NoDetach: Yes

               

              Target[gpu0.load]: `/home/vincent/bin/gpuload 0`
              MaxBytes[gpu0.load]: 99
              Title[gpu0.load]: gpu0 Load
              PageTop[gpu0.load]: <H1> gpu0 load</H1>
              ShortLegend[gpu0.load]: %
              YLegend[gpu0.load]: Load
              Options[gpu0.load]: growright,nopercent, nobanner, noinfo, gauge
              Unscaled[gpu0.load]: ymd

              -----------------

              run with: env LANG=C sudo /usr/bin/mrtg /etc/mrtg.conf

               

              Big problem of MRTG is the interval.. Not perfect for what I want.

               

              [ based on: hashcat Forum - mrtg script for monitoring temperature ]

            • Re: Monitoring GPU utilization
              himanshu.gautam

              Thank You Nou for pitching in...

               

              One crude way to look at is the "current clock speed"... I believe that the GPUs woud cool down when there is no work and run at lesser rate... May be, thats an indication..

              Now, how do you find it? May be, "clinfo" might help.... I think it has a "GPU Clock" rate field.

               

              There is also a tool called "gpu-z". Not sure if that can help.

               

              btw... CL_DEVICE_AVAILABLE says whether the device is available (or) not... It is not a temporal value, I guess. It is a permanent state of whether a device is available (or) not... There is a reason why GPUs have command queues....

              1 of 1 people found this helpful
                • Re: Re: Monitoring GPU utilization
                  vbrik

                  There is also a tool called "gpu-z". Not sure if that can help.

                  Not on Linux . amdconfig gets me what I need though.

                   

                  btw... CL_DEVICE_AVAILABLE says whether the device is available (or) not... It is not a temporal value, I guess. It is a permanent state of whether a device is available (or) not...

                  Thanks for clarifying this! Can you recommend some documentation that explains things like this? The OpenCL specification PDF I found did not make it clear.

                • Re: Monitoring GPU utilization
                  kd2

                  There is also github.com/clbr/radeontop (e.g., Announcing radeontop, a tool for viewing the GPU usage). The tool is simple --  not much more than mmapping /dev/mem, polling all the bits in GPU's GRBM_STATUS register (0x8010) and reporting periodic average of busy signals. The code is simple enough to easily modify for however you need to monitor. Likewise, the temperature, memory clock speed, compute engine clock speed and/or power setting can be extracted from other registers.


                  As far as memory usage, I don't believe there's a way of extracting that info out from the catalyst driver. But it is simple with the linux open-source driver (drivers/gpu/drm/radeon)...


                  $ sudo cat /sys/kernel/debug/dri/0/radeon_vram_mm

                  0x00000000-0x00000040: 0x00000040: used

                  0x00000040-0x00000140: 0x00000100: used

                  0x00000140-0x00000141: 0x00000001: used

                  0x00000141-0x00000142: 0x00000001: used

                  0x00000142-0x00000143: 0x00000001: used

                  0x00000143-0x00000148: 0x00000005: free

                  0x00000148-0x00001150: 0x00001008: used

                  0x00001150-0x000c0000: 0x000beeb0: free

                  total: 786432, used 4427 free 782005

                   

                  # sudo cat /sys/kernel/debug/dri/0/radeon_gtt_mm

                  0x00000000-0x00000001: 0x00000001: used

                  0x00000001-0x00000011: 0x00000010: used

                  0x00000011-0x00000111: 0x00000100: used

                  0x00000111-0x00000211: 0x00000100: used

                  0x00000211-0x00000311: 0x00000100: used

                  0x00000311-0x00000321: 0x00000010: used

                  0x00000321-0x00000331: 0x00000010: used

                  0x00000331-0x00000431: 0x00000100: used

                  0x00000431-0x00020000: 0x0001fbcf: free

                  total: 131072, used 1073 free 129999

                  1 of 1 people found this helpful