cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

vbrik
Journeyman III

Monitoring GPU utilization

Hello,

I am an administrator of a Linux cluster with about 100 GPUs, and I need to find a way to monitor their utilization. For example, the fraction of time a GPU is actually busy, how much memory is being used, etc.

Is there a utility similar to nvidia-smi for Radeon graphics cards that would show GPU usage statistics?

Does OpenCL provide an interface that could be used to monitor GPU utilization? For example, can I poll CL_DEVICE_AVAILABLE provided by clGetDeviceInfo to estimate how frequently a GPU is busy?

Thanks,

Vlad

0 Likes
1 Solution
nou
Exemplar

try run "amdconfig --odgc" and "amdconfig --odgt"

View solution in original post

0 Likes
5 Replies
nou
Exemplar

try run "amdconfig --odgc" and "amdconfig --odgt"

0 Likes

Thanks, that's useful!

Under Linux, to only get the numbers:

amdconfig --adapter=$1 --odgt | grep 'Temperature' | cut -d'-' -f2 | cut -d'.' -f1 | tr -d ' '

amdconfig --adapter=$1 --odgc | grep 'GPU load' | cut -f1 -d'%' | cut -f2 -d':'| tr -d ' '

Perfect to use in any monitor script. MRTG-config for Load is as follows:

--------------------

WorkDir: /var/www/mrtg/
Refresh: 300
Interval: 5

RunAsDaemon: Yes

NoDetach: Yes

Target[gpu0.load]: `/home/vincent/bin/gpuload 0`
MaxBytes[gpu0.load]: 99
Title[gpu0.load]: gpu0 Load
PageTop[gpu0.load]: <H1> gpu0 load</H1>
ShortLegend[gpu0.load]: %
YLegend[gpu0.load]: Load
Options[gpu0.load]: growright,nopercent, nobanner, noinfo, gauge
Unscaled[gpu0.load]: ymd

-----------------

run with: env LANG=C sudo /usr/bin/mrtg /etc/mrtg.conf

Big problem of MRTG is the interval.. Not perfect for what I want.

[ based on: hashcat Forum - mrtg script for monitoring temperature ]

0 Likes
himanshu_gautam
Grandmaster

Thank You Nou for pitching in...

One crude way to look at is the "current clock speed"... I believe that the GPUs woud cool down when there is no work and run at lesser rate... May be, thats an indication..

Now, how do you find it? May be, "clinfo" might help.... I think it has a "GPU Clock" rate field.

There is also a tool called "gpu-z". Not sure if that can help.

btw... CL_DEVICE_AVAILABLE says whether the device is available (or) not... It is not a temporal value, I guess. It is a permanent state of whether a device is available (or) not... There is a reason why GPUs have command queues....


There is also a tool called "gpu-z". Not sure if that can help.



Not on Linux . amdconfig gets me what I need though.


btw... CL_DEVICE_AVAILABLE says whether the device is available (or) not... It is not a temporal value, I guess. It is a permanent state of whether a device is available (or) not...



Thanks for clarifying this! Can you recommend some documentation that explains things like this? The OpenCL specification PDF I found did not make it clear.

0 Likes
kd2
Adept II

There is also github.com/clbr/radeontop (e.g., Announcing radeontop, a tool for viewing the GPU usage). The tool is simple --  not much more than mmapping /dev/mem, polling all the bits in GPU's GRBM_STATUS register (0x8010) and reporting periodic average of busy signals. The code is simple enough to easily modify for however you need to monitor. Likewise, the temperature, memory clock speed, compute engine clock speed and/or power setting can be extracted from other registers.


As far as memory usage, I don't believe there's a way of extracting that info out from the catalyst driver. But it is simple with the linux open-source driver (drivers/gpu/drm/radeon)...


$ sudo cat /sys/kernel/debug/dri/0/radeon_vram_mm

0x00000000-0x00000040: 0x00000040: used

0x00000040-0x00000140: 0x00000100: used

0x00000140-0x00000141: 0x00000001: used

0x00000141-0x00000142: 0x00000001: used

0x00000142-0x00000143: 0x00000001: used

0x00000143-0x00000148: 0x00000005: free

0x00000148-0x00001150: 0x00001008: used

0x00001150-0x000c0000: 0x000beeb0: free

total: 786432, used 4427 free 782005

# sudo cat /sys/kernel/debug/dri/0/radeon_gtt_mm

0x00000000-0x00000001: 0x00000001: used

0x00000001-0x00000011: 0x00000010: used

0x00000011-0x00000111: 0x00000100: used

0x00000111-0x00000211: 0x00000100: used

0x00000211-0x00000311: 0x00000100: used

0x00000311-0x00000321: 0x00000010: used

0x00000321-0x00000331: 0x00000010: used

0x00000331-0x00000431: 0x00000100: used

0x00000431-0x00020000: 0x0001fbcf: free

total: 131072, used 1073 free 129999