Hello,
I am an administrator of a Linux cluster with about 100 GPUs, and I need to find a way to monitor their utilization. For example, the fraction of time a GPU is actually busy, how much memory is being used, etc.
Is there a utility similar to nvidia-smi for Radeon graphics cards that would show GPU usage statistics?
Does OpenCL provide an interface that could be used to monitor GPU utilization? For example, can I poll CL_DEVICE_AVAILABLE provided by clGetDeviceInfo to estimate how frequently a GPU is busy?
Thanks,
Vlad
Solved! Go to Solution.
try run "amdconfig --odgc" and "amdconfig --odgt"
try run "amdconfig --odgc" and "amdconfig --odgt"
Thanks, that's useful!
Under Linux, to only get the numbers:
amdconfig --adapter=$1 --odgt | grep 'Temperature' | cut -d'-' -f2 | cut -d'.' -f1 | tr -d ' '
amdconfig --adapter=$1 --odgc | grep 'GPU load' | cut -f1 -d'%' | cut -f2 -d':'| tr -d ' '
Perfect to use in any monitor script. MRTG-config for Load is as follows:
--------------------
WorkDir: /var/www/mrtg/
Refresh: 300
Interval: 5
RunAsDaemon: Yes
NoDetach: Yes
Target[gpu0.load]: `/home/vincent/bin/gpuload 0`
MaxBytes[gpu0.load]: 99
Title[gpu0.load]: gpu0 Load
PageTop[gpu0.load]: <H1> gpu0 load</H1>
ShortLegend[gpu0.load]: %
YLegend[gpu0.load]: Load
Options[gpu0.load]: growright,nopercent, nobanner, noinfo, gauge
Unscaled[gpu0.load]: ymd
-----------------
run with: env LANG=C sudo /usr/bin/mrtg /etc/mrtg.conf
Big problem of MRTG is the interval.. Not perfect for what I want.
[ based on: hashcat Forum - mrtg script for monitoring temperature ]
Thank You Nou for pitching in...
One crude way to look at is the "current clock speed"... I believe that the GPUs woud cool down when there is no work and run at lesser rate... May be, thats an indication..
Now, how do you find it? May be, "clinfo" might help.... I think it has a "GPU Clock" rate field.
There is also a tool called "gpu-z". Not sure if that can help.
btw... CL_DEVICE_AVAILABLE says whether the device is available (or) not... It is not a temporal value, I guess. It is a permanent state of whether a device is available (or) not... There is a reason why GPUs have command queues....
There is also a tool called "gpu-z". Not sure if that can help.
Not on Linux . amdconfig gets me what I need though.
btw... CL_DEVICE_AVAILABLE says whether the device is available (or) not... It is not a temporal value, I guess. It is a permanent state of whether a device is available (or) not...
Thanks for clarifying this! Can you recommend some documentation that explains things like this? The OpenCL specification PDF I found did not make it clear.
There is also github.com/clbr/radeontop (e.g., Announcing radeontop, a tool for viewing the GPU usage). The tool is simple -- not much more than mmapping /dev/mem, polling all the bits in GPU's GRBM_STATUS register (0x8010) and reporting periodic average of busy signals. The code is simple enough to easily modify for however you need to monitor. Likewise, the temperature, memory clock speed, compute engine clock speed and/or power setting can be extracted from other registers.
As far as memory usage, I don't believe there's a way of extracting that info out from the catalyst driver. But it is simple with the linux open-source driver (drivers/gpu/drm/radeon)...
$ sudo cat /sys/kernel/debug/dri/0/radeon_vram_mm
0x00000000-0x00000040: 0x00000040: used
0x00000040-0x00000140: 0x00000100: used
0x00000140-0x00000141: 0x00000001: used
0x00000141-0x00000142: 0x00000001: used
0x00000142-0x00000143: 0x00000001: used
0x00000143-0x00000148: 0x00000005: free
0x00000148-0x00001150: 0x00001008: used
0x00001150-0x000c0000: 0x000beeb0: free
total: 786432, used 4427 free 782005
# sudo cat /sys/kernel/debug/dri/0/radeon_gtt_mm
0x00000000-0x00000001: 0x00000001: used
0x00000001-0x00000011: 0x00000010: used
0x00000011-0x00000111: 0x00000100: used
0x00000111-0x00000211: 0x00000100: used
0x00000211-0x00000311: 0x00000100: used
0x00000311-0x00000321: 0x00000010: used
0x00000321-0x00000331: 0x00000010: used
0x00000331-0x00000431: 0x00000100: used
0x00000431-0x00020000: 0x0001fbcf: free
total: 131072, used 1073 free 129999