(Windows 8.1 x64, HD7770, driver: 13.251-131206a-165817C-ATI)
Can anybody explain to me why the following kernel compiles to 140 VGPRs and 32 SGPRs?
From what I can see this kernel should only consume 14 VGPRs. Ok, throw in a few temp registers for calculations and stuff. But how does the compiler manage to spend 140 VGPRs on this code?
This is just one example of many where I'm scratching my head wondering what's going on. Trying to reduce VGPR usage (to increase occupancy) seems to be almost impossible with anything but the most trivial kernels, because the compiler doesn't seem to follow any logic with how it spends VGPRs. At least I don't understand it. If I'm missing something important, please let me know. I'm quite willing to learn and improve my skills.
P.S: Please don't expect this kernel to do something too useful, I've shortened it trying to isolate the problematic code...
__kernel __attribute__((reqd_work_group_size(8, 8, 1)))
void errorDiffusion(__read_only image2d_t srcImg, __write_only image2d_t dstImg)
__local float3 data;
int index = get_global_id(0) * 8 + get_global_id(1);
int posx = get_global_id(0) * 8;
int posy = get_global_id(1) * 8;
for (int i1 = 0; i1 < 16; i1++)
float3 pix, rounded, error;
for (int i2 = 0; i2 < 8; i2++)
data[index][i2] += read_imagef(srcImg, srcSampler, (int2) (posx + i2, posy)).s012 * 255.0f;
pix = data[index];
rounded = round(pix);
error = pix - rounded;
write_imagef(dstImg, (int2) (posx, posy), (float4) (rounded / 255.0f, 0));
pix = data[index] + error * 0.777;
data[index] = error * 0.7861;
data[index] = error * -0.6098;
for (int i2 = 1; i2 < 8; i2++)
rounded = round(pix);
error = pix - rounded;
write_imagef(dstImg, (int2) (posx + i2, posy), (float4) (rounded / 255.0f, 0));
pix = data[index][i2 + 1] + error * 0.777;
data[index][i2 - 1] += error * 0.0090;
data[index][i2 ] += error * 0.7861;
data[index][i2 + 1] = error * -0.6098;
Thanks for the suggestion. I've tried "#pragma disable unroll" and "#pragma unroll 0". Compilation failed for both. Then I tried "#pragma nounroll" and it compiled fine. But it didn't make any difference to VGPR/SGPR usage.
Edit: Or did you mean to enable unroll instead of disabling it? Tried that, too. It reduced the VGPR usage to 102 registers instead of 140. It's a nice decrease, but still *far* from making any sense. With this specific kernel, I would expect a VGPR usage of maybe 20-25.
Also, I don't understand why forcing a loop to unroll would reduce VGPR usage from 140 to 102? Loop counters should be SGPRs. So unrolling loops should decrease SGPR usage but not VGPR usage, or am I wrong? The AMD "OpenCL Dos and Don'ts" document seems to agree. It says:
> Loop unrolling can be used to improve performance
> by removing overhead of branching
The AMD document doesn't suggest in any way that unrolling would reduce VGPR usage.
to disable unroll you must use #rpagma unroll 1 more in OpenCL AMD APP OpenCL programing guide section 5.8.1.
another possibility is that you specify 64 as required work-size. it is possible that compiler doesn't optimize register usage as there is plenty of them. you can try remove it or specify 256 as required work size so compiler is forced to conserve registers.
also why it this problem? as long there are no scratch register it shouldn't matter how many are used.
Ah, I see. I've tried "#pragma unroll 1" now, I've also tried using 256 as required work size, but neither helps. VGPR usage stays the same.
I thought that increasing occupancy would improve performance. Does it not? The reason I'm trying to limit VGPR usage is that CodeXL reports that VGPR limits occupancy to 10%.
had the similar problem: VGPRs as intermediate storage
Compiler likes to use VGPR for every intermediate result like A+B if this A+B used more than once.
This maybe driver dependent (tested on 13.9), but you could try to fool compiler by using different expressions like
error * 0.0090 for the first time
and mad(error, 0.0090, 1) - 1 for the second time
setting required work group size to higher values also helps sometimes.
And sometimes there are no way to predict usage or understand logic.
On one my kernel i packed all my short variables into two short8 vectors and decreased VGPR number by 20.
On another kernel (very similar to the first) it gave +10 VGPRs.
Hmmmm... Interesting! Like you say in your thread, it might not make much sense to try to optimize, because with the current AMD compiler/optimizer, it's like shooting in the dark.
Did you solve this problem? I had the same problem in my sample code:
I am actually looking at a program which uses 255 VGPR according to kernel analyzer and it runs 10 times slower on an AMD GPU compared to an Nvidia GPU. I am also getting 10% occupancy is there any solutions?
Before with 13.9 kernel used 99 VGPRs (and it is high, must be lower)
13.11 and 13.12 won't compile correctly at all.
After upgrade to 14.1 the same kernel uses 135 VGPRs.
This is nice, 2-3 driver upgrades more and empty kernels will generate scratch regs...
May be some magic like placing comands that do nothing will lower VGPR usage. There were such moments with 13.9, but they were totally unpredictable.
Reminds me of a blog post I wrote 2 years ago: "The ‘optimising’ compiler in the AMD APP SDK can essentially be treated as a random function". See at AMD’s OpenCL heaven and hell | Wonderings of a SAT geek. Sadly most of what's there is still true. Except, maybe, that they now also miscompile bitwise operations, which is a new, most welcome addition.