cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

yurtesen
Miniboss

VGPR usage question (code attached)

I have a simple test kernel which I compile in kernelanalyzer2. It says on Tahiti it will use 110 VGPR. I tried to put #pragma unroll 1 and it does not have any effect at all. Is there a known way to avoid compiler from using so many registers? ( keep in mind this is a dummy test kernel, but this seems to effect an actual kernel and reduce occupancy)

__kernel void test (__global double *distrValues, __global double *distrValuesOut) {

    __private const int id = get_global_id(0);

    __private double den2;

     

  __private int i;

   for (i=0;i<53;i++) {

       den2 += distrValues;

   }

    distrValuesOut[id]=den2;

}

Thanks!

0 Likes
1 Solution

Hi yurtsen,

If I compile the code in first thread for Tahiti it uses 112 VGPRs. If I change loop limit to 54, it simply uses 20 VGPRs. Dont you think something is strange here? If I change the code as follows, it uses only 8 VGPRs !!! (although this last one does not seem to have the same effect in my actual program for some reason)

you are right. I can also see these observations. I will write a small host code and check actual performance on running this kernel. Also I will forward it to a appropriate team.

View solution in original post

0 Likes
20 Replies
dmeiser
Elite

You could pass in the loop count 53 as a parameter to the kernel. That should prevent automatic unrolling. If you unroll by hand you should be able to balance loop overhead and register usage in a pretty controlled way.

Actually that was what I tried to do. It seemed to work in the kernel analyzer in the test program only. However in the real program VGPRs went from 82 to 124 in kernel analyzer, and CodeXL says actual runs use 196 VGPRS after the change. I lost performance because occupancy went down to 10% due to increased VGPR usage.

I passed the int as "__global int *myint" so I am able to switch the value *myint and 19 in the loop and immediately see the difference.

0 Likes

Did you try passing the loop count in as "int".

Have you checked the generated isa to see if the loop is still unrolled even with a loop count that is unknown at compile time?

0 Likes

The IL code appears to be NOT unrolled in any case (I compared with setting unroll pragma to a larger value), but ISA code seem to load a pile of values to registers then start adding them. But actual unrolled ISA code also looks little different.

It stops using so many registers if the loop counter is over 53 (in the test case), then it does use so much less registers (can see it in Kernel Analyzer easily also). All of a sudden the amount of registers used goes down to 20 something...

I will try int, but I am not sure what difference it may make. Is this a bug? It seems quite strange that compiler is trying to use so many registers, especially when I tell it to specifically not unroll.

Thanks,

Evren

0 Likes

yurtesen wrote:

I will try int, but I am not sure what difference it may make. Is this a bug? It seems quite strange that compiler is trying to use so many registers, especially when I tell it to specifically not unroll.

Agreed. I too have found the compiler to be way too aggressive when it comes to loop unrolling on several occasions. It would be nice if there was an easier (and more explicit) way to control the level of unrolling.

0 Likes

I changed kernel argument "__global int *myint" to "__private int myint" and changed *myint to myint in for loop (obviously). and register usage went from 124 to 132. (from Kernel Analyzer)

I removed it from being a kernel argument and I tried setting const int myint = get_global_id(); (also without const) and register usage went up to 134 (from Kernel Analyzer)

If I change from #pragma unroll 1 to #pragma unroll 2 then it goes back to 124 registers from 134... (strange since unrolling should probably increase register usage?)

Strangely on Cypress it says register usage floats around 28 - 35 VGPR.

I dont know what to do about this who should we inform?

0 Likes

yurtesen wrote:

I dont know what to do about this who should we inform?

Micah Villmov does a fair bit of development on the opencl compiler. I'm sure he could shed some light on why the compiler behaves the way it does.

0 Likes

Hi yurtsen,

It may be asking too much, but can you attach a small testcase.

Anyways maybe if you disable the kernel optimizations, you might be able to see more consistent behaviour in the experiments you are doing. I will ask a some with more expertise, to help you too.

0 Likes

Hi Himanshu,

Isnt the kernel code I provided simple enough? You can simply compile it using KernelAnalyzer2 for Tahiti. The problem is the generated ISA code, so no execution should be necessary for investigation.

If I compile the code in first thread for Tahiti it uses 112 VGPRs. If I change loop limit to 54, it simply uses 20 VGPRs. Dont you think something is strange here? If I change the code as follows, it uses only 8 VGPRs !!! (although this last one does not seem to have the same effect in my actual program for some reason)

__kernel void test (__global double *distrValues, __global double *distrValuesOut, __global int *j) {

       __private const int id = get_global_id(0);

       __private double den2;

       __private int i;

       for (i=0;i<*j;i++) {

           den2 += distrValues;

       }

       distrValuesOut[id]=den2;

}

Thanks,

Evren

0 Likes

Hi yurtsen,

If I compile the code in first thread for Tahiti it uses 112 VGPRs. If I change loop limit to 54, it simply uses 20 VGPRs. Dont you think something is strange here? If I change the code as follows, it uses only 8 VGPRs !!! (although this last one does not seem to have the same effect in my actual program for some reason)

you are right. I can also see these observations. I will write a small host code and check actual performance on running this kernel. Also I will forward it to a appropriate team.

0 Likes

himanshu.gautam wrote:

you are right. I can also see these observations. I will write a small host code and check actual performance on running this kernel. Also I will forward it to a appropriate team.

Independent of performance, I would like the compiler to obey unroll pragmas (and I believe this is a reasonable request?). I dont care much what it does if there is no pragma of course if this gives worse performance due to compiler aggressiveness, it should be fixed.

0 Likes

bumping - the compiler does indeed get too register hungry and we need to be able to limit it to achieve better occupancy.  Yes better occupancy does not necessarily mean better performance but it is an important knob... especially if the compiler sucks VGR allocation.

I noticed exacly the same issues with my kernel - which contains loops with trip counts known at compile time.

The compiler does a horribly job balancing unrolling and register use (it even explodes below the recommended code size).

I would prefer better heuristics to make the compiler smarter when and how aggressive to unroll, but for the time being I could live with the unroll-1 pragma not being ignored...

Br, Clemens

0 Likes
yurtesen
Miniboss

I am not sure if AMD is interested in fixing these problems. I would like to see possibility for telling compiler to limit number of available registers, so I can balance between register spills and occupancy. (since compiler seems to be doing a terrible job at that...)

Hi,

In a previous post you mentioned:

"I am actually looking at a program which uses 255 VGPR according to kernel analyzer and it runs 10 times slower on an AMD GPU compared to an Nvidia GPU. I am also getting 10% occupancy is there any solutions?"

Did you ever be manage to figure out the problem,? I have a very similar kernel with 231 VGPRs and the occupancy is 10%. I am just curious to know if you had a solution

0 Likes

@pbani there are some strategies you can use that can make a difference but you can only influence the compiler's hueristics - not the vgpr usage directly.  I have plenty of routines new and old that are also limited because the compiler keeps around too many intermediate calculations where it should just restrict itself for better occupancy - which will usually, for this problem, result in faster kernel invocations.

0 Likes

Thank you @jason. Could you provide me some examples please?

0 Likes

Here's a thread that touches on a few of these: http://devgurus.amd.com/thread/169103

I have another question ,

I realized when I set the local work group size to be 16, the runtime is half compared to when it is 64, or any other number. What could it mean? It's a Tahiti

0 Likes

@pbani - no idea.  Try and used the reqd_work_group_size and see if that has any influence on 64.  If it doesn't, analyze the situation under AMD's CodeXL GPU profiling.

0 Likes