cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

FangQ
Adept I

Strategies on reducing VGPR usage - and, where do they come from?

Aside from a detrimental memory latency issue I reported in this thread, I also noticed that my OpenCL code on AMD GPUs suffered from large VGPR usage.

For the voxel-based Monte Carlo simulator, MCXCL (https://github.com/fangq/mcxcl), rcprof reported the below metrics:

Number of VGPR usedNumber of SGPR usedAmount of LDS used
101103260
Nbr VGPR-limited wavesNbr SGPR-limited wavesNbr LDS-limited wavesNbr of WG-limited wavesKernel occupancy
828404020

For my mesh-based simulator, the occupancy is even worse

Number of VGPR usedNumber of SGPR usedAmount of LDS used
1851034
Nbr VGPR-limited wavesNbr SGPR-limited wavesNbr LDS-limited wavesNbr of WG-limited wavesKernel occupancy
428404010

In Both cases, the maximum wave number is limited by the large VGPR number.

However, the part that I don't understand is that I don't see why I am using such large number of registers. If I compile my voxel code on NVIDIA devices and use 

-cl-nv-verbose 

flag, it only reports 64 registers. I am surprised that the AMD compiler required such a large number of registers (VGPR=101, SGPR=103) for the same kernel.

Also,by simply visually inspecting my kernel, I could not even count such large number of registers - 

https://github.com/fangq/mcxcl/blob/master/src/mcx_core.cl#L878-L897

where all these registers come from? does AMD compiler count shared memory buffer and constant memory buffer as registers? 

if anyone want to do a test, here are the command to produce the above reports

git clone https://github.com/fangq/mcxcl.git  

cd mcxcl/src

make clean

make

cd ../example/qtest

rcprof -o 'benchmarkmcxcl' -w `pwd` -O -p -t -T ../../bin/mcxcl -A -n 1e7 -f qtest.inp

cat benchmarkmcxcl.occupancy

Can anyone take a quick look at my kernel file and tell me where these "extra" VGPRs and SGPRs came from? I could probably at least do some manual optimization. 

thanks a lot

0 Likes
2 Replies
FangQ
Adept I

forgot to mention, my above test was performed on Vega64 and VegaII using the latest amdgpu-pro driver (the rcprof came from CodeXL-2.6-302)

0 Likes

I have couple of suggestions here:

  • Currently, AMD OpenCL compilers does not provide a direct way to control the register usage. However, in some cases, kernel attributes can be used to improve the kernel performance. In general, the compiler takes a conservative approach during VGPR allocation. It assumes that the work-group size is 256 i.e. the largest possible work-group size, hence limits max. number of VGPRs per work-item. To allocate more number of VGPRs, the kernel should use required_work_group_size attribute which specifies to the compiler that the kernel is launched with a work-group size smaller than the maximum, allowing it to allocate more VGPRs. Hence, it may improve the overall performance.

Thanks.

0 Likes