OpenCL

FangQ · ‎03-28-2019

Aside from a detrimental memory latency issue I reported in this thread, I also noticed that my OpenCL code on AMD GPUs suffered from large VGPR usage.

For the voxel-based Monte Carlo simulator, MCXCL (https://github.com/fangq/mcxcl), rcprof reported the below metrics:

Number of VGPR used	Number of SGPR used	Amount of LDS used
101	103	260

Nbr VGPR-limited waves	Nbr SGPR-limited waves	Nbr LDS-limited waves	Nbr of WG-limited waves	Kernel occupancy
8	28	40	40	20

For my mesh-based simulator, the occupancy is even worse

Number of VGPR used	Number of SGPR used	Amount of LDS used
185	103	4

Nbr VGPR-limited waves	Nbr SGPR-limited waves	Nbr LDS-limited waves	Nbr of WG-limited waves	Kernel occupancy
4	28	40	40	10

In Both cases, the maximum wave number is limited by the large VGPR number.

However, the part that I don't understand is that I don't see why I am using such large number of registers. If I compile my voxel code on NVIDIA devices and use

-cl-nv-verbose

flag, it only reports 64 registers. I am surprised that the AMD compiler required such a large number of registers (VGPR=101, SGPR=103) for the same kernel.

Also,by simply visually inspecting my kernel, I could not even count such large number of registers -

https://github.com/fangq/mcxcl/blob/master/src/mcx_core.cl#L878-L897

where all these registers come from? does AMD compiler count shared memory buffer and constant memory buffer as registers?

if anyone want to do a test, here are the command to produce the above reports

git clone https://github.com/fangq/mcxcl.git

cd mcxcl/src

make clean

make

cd ../example/qtest

rcprof -o 'benchmarkmcxcl' -w `pwd` -O -p -t -T ../../bin/mcxcl -A -n 1e7 -f qtest.inp

cat benchmarkmcxcl.occupancy

Can anyone take a quick look at my kernel file and tell me where these "extra" VGPRs and SGPRs came from? I could probably at least do some manual optimization.

thanks a lot

FangQ · ‎03-28-2019

forgot to mention, my above test was performed on Vega64 and VegaII using the latest amdgpu-pro driver (the rcprof came from CodeXL-2.6-302)

dipak · ‎04-01-2019

I have couple of suggestions here:

You can use RGA tool (https://gpuopen.com/gaming-product/radeon-gpu-analyzer-rga/) to produce a live VGPR analysis report for your kernels and try to identify the code blocks with higher VGPR pressure, and opportunities for register usage optimizations.

Currently, AMD OpenCL compilers does not provide a direct way to control the register usage. However, in some cases, kernel attributes can be used to improve the kernel performance. In general, the compiler takes a conservative approach during VGPR allocation. It assumes that the work-group size is 256 i.e. the largest possible work-group size, hence limits max. number of VGPRs per work-item. To allocate more number of VGPRs, the kernel should use required_work_group_size attribute which specifies to the compiler that the kernel is launched with a work-group size smaller than the maximum, allowing it to allocate more VGPRs. Hence, it may improve the overall performance.

Thanks.

OpenCL

Strategies on reducing VGPR usage - and, where do they come from?