OS: Kubuntu 12.04 x64
Driver: Catlyst 13.4
I know this sounds crazy but it is actually true. Increasing the number of kernel arguments beyond a certain number causes the performance to drop. In one of my kernels if I move from 6 to 7 kernel arguments performance drops by 30% even though I don't do any computation with the the new kernel argument. I can't reduce the number of arguments by assembling them into a struct because all of them are dynamic arrays of variable length. Is there a way to avoid this problem ?
On GCN if you're using (not only declaring but using) more than 2 buffers, there is an additional indirect read in the kernel code that reads the buffer_resource_constant and the 64bit_base_offset from individual 'initial' buffer resources that the kernel gets from outside when it starts. There are only 3 of those, so if you can't fit in 3, then the OpenCL driver constructs little arrays of the required information in order to fit and loads them manually at the start of the kernel (see the isa source with various number of params!).
If your test kernel doesn't do much things, then the overall slowdown caused by this could be big. If your can write a long kernel in the future this will be no problem.
If I had to guess I would say the compiler has decided to spill some of the GPRs to global memory thus leading to performance drop.
You can use KernelAnalyzer2 to observe the ISA for such scenario (search for ScratchSize > 0 ) .
NumVgprs = 76;
NumSgprs = 72;
FloatMode = 192;
IeeeMode = 0;
ScratchSize = 192;
One key difference I noted is that increasing the number of arguments increasing VGPR count. For instance in my case for 6 args, VGPR count was 127 while with 7 args, VGPR count was 135. I think this is what makes the difference. The number of in-flight wavefronts gets halved in second case.
In the above attached kernel you can add dummy arguments to see the effect of increasing kernel arguments with kernelAnalyzer2. In kernel md5_self_test() having 3 kernel args, I added a few( 5 extra ) dummy kernel arguments that increased the number of vgpr usage from 65 to 105. Is the kernelAnalyzer2 reporting correctly ?
Also these files are required to run kernelAnalyzer:
Could not see VGPR usage increase after adding some dummy variables. I always got 43 VGPRS for hd7xxx compilation, using kernelanalyzer 2.1.880.
BTW opencl_rawmd5_fmt was missing, and copied directly using the github repo.
In CUDA we have compiler option that allows to specify required number of registers per workitem for compiler to allocate. It will use scratch registers only when needed number of registers exceeds specified one and will use own heuristics if param not specified. Hence programmer can specify when he want to sacrify number of waves in fly for number of registers per workitem (and avoid big slowdown of scratch registers that even worse than number of waves in flight decrease).
Can we have same possibility for OpenCL compiler? Via some compiler option or environment variable that being read by OpenCL runtime...
There is no such compiler option provided by AMD. However I am unable to see how such an option could be helpful.
Maximum registers per work items are limited by the hardware and the compiler option -maxregcount can specify registers lower than this hardware limit. Let us now assume that the hardware limit is NMax, compiler option is -maxregcount=N, and the kernel actually uses M registers/work item. If M < N, the wave-fronts (warps) per CU (or SM) will be same with or without the compiler option. If M > N, occupancy (No of wave-fronts/ CU) is more, but at the expense of pushing local data to global memory (or L1 in best possible case). Even with increased occupancy, since local data is now being fetched from global memory, performance gain can not be expected.
On the other hand, if we want to limit registers/work item, the kernel could be rewritten in a way to use at max N registers and push rest of the data to shared/global memory. This code will explicitly state our purpose of increasing occupancy, instead of oblique way of specifying it using compiler option, which is in danger of being overlooked for any future modification of the kernel code.
Sometimes one needs to decrease occupancy to have more registers available. That's when this option will help.
From my recent experiments I found that AMD compiler tries to keep 8 waves in flight even if some scratch registers will be used.
Specifying big reg number value in option will give order to compiler not to use scratch registers and sacrifice occupancy for more registers per workitem. Sometimes it can be needed.
Now there is no way to tell compiler about this.
And, as I wrote in other thread, even artifical limiting number of waves to 4 (signle workgroup of 256 workitems) by using whole LDS per workgroup will not change compiler registers allocation. It will generate registers spilling code still as if 8 waves could be used.