20 Replies Latest reply on May 6, 2015 4:39 AM by jason

    VGPR usage question (code attached)

    yurtesen

      I have a simple test kernel which I compile in kernelanalyzer2. It says on Tahiti it will use 110 VGPR. I tried to put #pragma unroll 1 and it does not have any effect at all. Is there a known way to avoid compiler from using so many registers? ( keep in mind this is a dummy test kernel, but this seems to effect an actual kernel and reduce occupancy)

       

      __kernel void test (__global double *distrValues, __global double *distrValuesOut) {
          __private const int id = get_global_id(0);
          __private double den2;
            
        __private int i;
      
         for (i=0;i<53;i++) {
             den2 += distrValues[i];
         }
          distrValuesOut[id]=den2;
      }
      

       

       

      Thanks!

        • Re: VGPR usage question (code attached)
          dmeiser

          You could pass in the loop count 53 as a parameter to the kernel. That should prevent automatic unrolling. If you unroll by hand you should be able to balance loop overhead and register usage in a pretty controlled way.

          1 of 1 people found this helpful
            • Re: VGPR usage question (code attached)
              yurtesen

              Actually that was what I tried to do. It seemed to work in the kernel analyzer in the test program only. However in the real program VGPRs went from 82 to 124 in kernel analyzer, and CodeXL says actual runs use 196 VGPRS after the change. I lost performance because occupancy went down to 10% due to increased VGPR usage.

               

              I passed the int as "__global int *myint" so I am able to switch the value *myint and 19 in the loop and immediately see the difference.

                • Re: VGPR usage question (code attached)
                  dmeiser

                  Did you try passing the loop count in as "int".

                   

                  Have you checked the generated isa to see if the loop is still unrolled even with a loop count that is unknown at compile time?

                    • Re: VGPR usage question (code attached)
                      yurtesen

                      The IL code appears to be NOT unrolled in any case (I compared with setting unroll pragma to a larger value), but ISA code seem to load a pile of values to registers then start adding them. But actual unrolled ISA code also looks little different.

                       

                      It stops using so many registers if the loop counter is over 53 (in the test case), then it does use so much less registers (can see it in Kernel Analyzer easily also). All of a sudden the amount of registers used goes down to 20 something...

                       

                      I will try int, but I am not sure what difference it may make. Is this a bug? It seems quite strange that compiler is trying to use so many registers, especially when I tell it to specifically not unroll.

                       

                      Thanks,

                      Evren

                        • Re: VGPR usage question (code attached)
                          dmeiser

                          yurtesen wrote:

                           

                          I will try int, but I am not sure what difference it may make. Is this a bug? It seems quite strange that compiler is trying to use so many registers, especially when I tell it to specifically not unroll.

                           

                          Agreed. I too have found the compiler to be way too aggressive when it comes to loop unrolling on several occasions. It would be nice if there was an easier (and more explicit) way to control the level of unrolling.

                            • Re: VGPR usage question (code attached)
                              yurtesen

                              I changed kernel argument "__global int *myint" to "__private int myint" and changed *myint to myint in for loop (obviously). and register usage went from 124 to 132. (from Kernel Analyzer)

                               

                              I removed it from being a kernel argument and I tried setting const int myint = get_global_id(); (also without const) and register usage went up to 134 (from Kernel Analyzer)

                               

                              If I change from #pragma unroll 1 to #pragma unroll 2 then it goes back to 124 registers from 134... (strange since unrolling should probably increase register usage?)

                               

                              Strangely on Cypress it says register usage floats around 28 - 35 VGPR.

                               

                              I dont know what to do about this who should we inform?

                                • Re: VGPR usage question (code attached)
                                  dmeiser

                                  yurtesen wrote:

                                   

                                  I dont know what to do about this who should we inform?

                                  Micah Villmov does a fair bit of development on the opencl compiler. I'm sure he could shed some light on why the compiler behaves the way it does.

                                  • Re: VGPR usage question (code attached)
                                    himanshu.gautam

                                    Hi yurtsen,

                                    It may be asking too much, but can you attach a small testcase.

                                     

                                    Anyways maybe if you disable the kernel optimizations, you might be able to see more consistent behaviour in the experiments you are doing. I will ask a some with more expertise, to help you too.

                                      • Re: VGPR usage question (code attached)
                                        yurtesen

                                        Hi Himanshu,

                                         

                                        Isnt the kernel code I provided simple enough? You can simply compile it using KernelAnalyzer2 for Tahiti. The problem is the generated ISA code, so no execution should be necessary for investigation.

                                         

                                        If I compile the code in first thread for Tahiti it uses 112 VGPRs. If I change loop limit to 54, it simply uses 20 VGPRs. Dont you think something is strange here? If I change the code as follows, it uses only 8 VGPRs !!! (although this last one does not seem to have the same effect in my actual program for some reason)

                                         

                                        __kernel void test (__global double *distrValues, __global double *distrValuesOut, __global int *j) {
                                               __private const int id = get_global_id(0);
                                               __private double den2;
                                        
                                               __private int i;
                                        
                                               for (i=0;i<*j;i++) {
                                                   den2 += distrValues[i];
                                               }
                                               distrValuesOut[id]=den2;
                                        }
                                        

                                         

                                        Thanks,

                                        Evren

                                          • Re: VGPR usage question (code attached)
                                            himanshu.gautam

                                            Hi yurtsen,

                                             

                                             

                                            If I compile the code in first thread for Tahiti it uses 112 VGPRs. If I change loop limit to 54, it simply uses 20 VGPRs. Dont you think something is strange here? If I change the code as follows, it uses only 8 VGPRs !!! (although this last one does not seem to have the same effect in my actual program for some reason)

                                             

                                            you are right. I can also see these observations. I will write a small host code and check actual performance on running this kernel. Also I will forward it to a appropriate team.

                                              • Re: VGPR usage question (code attached)
                                                yurtesen

                                                himanshu.gautam wrote:

                                                you are right. I can also see these observations. I will write a small host code and check actual performance on running this kernel. Also I will forward it to a appropriate team.

                                                Independent of performance, I would like the compiler to obey unroll pragmas (and I believe this is a reasonable request?). I dont care much what it does if there is no pragma of course if this gives worse performance due to compiler aggressiveness, it should be fixed.

                                                • Re: VGPR usage question (code attached)
                                                  jason

                                                  bumping - the compiler does indeed get too register hungry and we need to be able to limit it to achieve better occupancy.  Yes better occupancy does not necessarily mean better performance but it is an important knob... especially if the compiler sucks VGR allocation.

                                                    • Re: VGPR usage question (code attached)
                                                      Linuxhippy

                                                      I noticed exacly the same issues with my kernel - which contains loops with trip counts known at compile time.

                                                      The compiler does a horribly job balancing unrolling and register use (it even explodes below the recommended code size).

                                                       

                                                      I would prefer better heuristics to make the compiler smarter when and how aggressive to unroll, but for the time being I could live with the unroll-1 pragma not being ignored...

                                                       

                                                      Br, Clemens

                                  • Re: VGPR usage question (code attached)
                                    yurtesen

                                    I am not sure if AMD is interested in fixing these problems. I would like to see possibility for telling compiler to limit number of available registers, so I can balance between register spills and occupancy. (since compiler seems to be doing a terrible job at that...)