Aside from a detrimental memory latency issue I reported in this thread, I also noticed that my OpenCL code on AMD GPUs suffered from large VGPR usage.
For the voxel-based Monte Carlo simulator, MCXCL (https://github.com/fangq/mcxcl), rcprof reported the below metrics:
Number of VGPR used | Number of SGPR used | Amount of LDS used |
101 | 103 | 260 |
Nbr VGPR-limited waves | Nbr SGPR-limited waves | Nbr LDS-limited waves | Nbr of WG-limited waves | Kernel occupancy |
8 | 28 | 40 | 40 | 20 |
For my mesh-based simulator, the occupancy is even worse
Number of VGPR used | Number of SGPR used | Amount of LDS used |
185 | 103 | 4 |
Nbr VGPR-limited waves | Nbr SGPR-limited waves | Nbr LDS-limited waves | Nbr of WG-limited waves | Kernel occupancy |
4 | 28 | 40 | 40 | 10 |
In Both cases, the maximum wave number is limited by the large VGPR number.
However, the part that I don't understand is that I don't see why I am using such large number of registers. If I compile my voxel code on NVIDIA devices and use
-cl-nv-verbose
flag, it only reports 64 registers. I am surprised that the AMD compiler required such a large number of registers (VGPR=101, SGPR=103) for the same kernel.
Also,by simply visually inspecting my kernel, I could not even count such large number of registers -
https://github.com/fangq/mcxcl/blob/master/src/mcx_core.cl#L878-L897
where all these registers come from? does AMD compiler count shared memory buffer and constant memory buffer as registers?
if anyone want to do a test, here are the command to produce the above reports
git clone https://github.com/fangq/mcxcl.git
cd mcxcl/src
make clean
make
cd ../example/qtest
rcprof -o 'benchmarkmcxcl' -w `pwd` -O -p -t -T ../../bin/mcxcl -A -n 1e7 -f qtest.inp
cat benchmarkmcxcl.occupancy
Can anyone take a quick look at my kernel file and tell me where these "extra" VGPRs and SGPRs came from? I could probably at least do some manual optimization.
thanks a lot