AnsweredAssumed Answered

Strategies on reducing VGPR usage - and, where do they come from?

Question asked by FangQ on Mar 28, 2019
Latest reply on Apr 1, 2019 by dipak

Aside from a detrimental memory latency issue I reported in this thread, I also noticed that my OpenCL code on AMD GPUs suffered from large VGPR usage.

 

For the voxel-based Monte Carlo simulator, MCXCL (https://github.com/fangq/mcxcl), rcprof reported the below metrics:

 

Number of VGPR usedNumber of SGPR usedAmount of LDS used
101103260
Nbr VGPR-limited wavesNbr SGPR-limited wavesNbr LDS-limited wavesNbr of WG-limited wavesKernel occupancy
828404020

 

For my mesh-based simulator, the occupancy is even worse

 

Number of VGPR usedNumber of SGPR usedAmount of LDS used
1851034
Nbr VGPR-limited wavesNbr SGPR-limited wavesNbr LDS-limited wavesNbr of WG-limited wavesKernel occupancy
428404010

 

In Both cases, the maximum wave number is limited by the large VGPR number.

 

However, the part that I don't understand is that I don't see why I am using such large number of registers. If I compile my voxel code on NVIDIA devices and use 

-cl-nv-verbose 

flag, it only reports 64 registers. I am surprised that the AMD compiler required such a large number of registers (VGPR=101, SGPR=103) for the same kernel.

 

Also,by simply visually inspecting my kernel, I could not even count such large number of registers - 

 

https://github.com/fangq/mcxcl/blob/master/src/mcx_core.cl#L878-L897

 

where all these registers come from? does AMD compiler count shared memory buffer and constant memory buffer as registers? 

 

if anyone want to do a test, here are the command to produce the above reports

 

git clone https://github.com/fangq/mcxcl.git  

cd mcxcl/src

make clean

make

cd ../example/qtest

rcprof -o 'benchmarkmcxcl' -w `pwd` -O -p -t -T ../../bin/mcxcl -A -n 1e7 -f qtest.inp

cat benchmarkmcxcl.occupancy

 

Can anyone take a quick look at my kernel file and tell me where these "extra" VGPRs and SGPRs came from? I could probably at least do some manual optimization. 

 

thanks a lot

Outcomes