9 Replies Latest reply on Feb 2, 2014 7:23 AM by msoos

    AMD OpenCL compiler wasting VGPRs like crazy


      (Windows 8.1 x64, HD7770, driver: 13.251-131206a-165817C-ATI)


      Can anybody explain to me why the following kernel compiles to 140 VGPRs and 32 SGPRs?


      From what I can see this kernel should only consume 14 VGPRs. Ok, throw in a few temp registers for calculations and stuff. But how does the compiler manage to spend 140 VGPRs on this code?


      This is just one example of many where I'm scratching my head wondering what's going on. Trying to reduce VGPR usage (to increase occupancy) seems to be almost impossible with anything but the most trivial kernels, because the compiler doesn't seem to follow any logic with how it spends VGPRs. At least I don't understand it. If I'm missing something important, please let me know. I'm quite willing to learn and improve my skills.




      P.S: Please don't expect this kernel to do something too useful, I've shortened it trying to isolate the problematic code...



      __kernel __attribute__((reqd_work_group_size(8, 8, 1)))

      void errorDiffusion(__read_only image2d_t srcImg, __write_only image2d_t dstImg)


        __local float3 data[64][8];

        int index = get_global_id(0) * 8 + get_global_id(1);

        int posx = get_global_id(0) * 8;

        int posy = get_global_id(1) * 8;


        for (int i1 = 0; i1 < 16; i1++)


          float3 pix, rounded, error;

          for (int i2 = 0; i2 < 8; i2++)

            data[index][i2] += read_imagef(srcImg, srcSampler, (int2) (posx + i2, posy)).s012 * 255.0f;

          pix = data[index][0];

          rounded = round(pix);

          error = pix - rounded;

          write_imagef(dstImg, (int2) (posx, posy), (float4) (rounded / 255.0f, 0));

          pix = data[index][1] + error * 0.777;

          data[index][0] = error *  0.7861;

          data[index][1] = error * -0.6098;

          for (int i2 = 1; i2 < 8; i2++)


            rounded = round(pix);

            error = pix - rounded;

            write_imagef(dstImg, (int2) (posx + i2, posy), (float4) (rounded / 255.0f, 0));

            pix = data[index][i2 + 1] + error * 0.777;

            data[index][i2 - 1] += error *  0.0090;

            data[index][i2    ] += error *  0.7861;

            data[index][i2 + 1]  = error * -0.6098;





        • Re: AMD OpenCL compiler wasting VGPRs like crazy

          try disable unroll. with #pragma

            • Re: AMD OpenCL compiler wasting VGPRs like crazy

              Thanks for the suggestion. I've tried "#pragma disable unroll" and "#pragma unroll 0". Compilation failed for both. Then I tried "#pragma nounroll" and it compiled fine. But it didn't make any difference to VGPR/SGPR usage.


              Edit: Or did you mean to enable unroll instead of disabling it? Tried that, too. It reduced the VGPR usage to 102 registers instead of 140. It's a nice decrease, but still *far* from making any sense. With this specific kernel, I would expect a VGPR usage of maybe 20-25.


              Also, I don't understand why forcing a loop to unroll would reduce VGPR usage from 140 to 102? Loop counters should be SGPRs. So unrolling loops should decrease SGPR usage but not VGPR usage, or am I wrong? The AMD "OpenCL Dos and Don'ts" document seems to agree. It says:


              > Loop unrolling can be used to improve performance

              > by removing overhead of branching


              The AMD document doesn't suggest in any way that unrolling would reduce VGPR usage.

                • Re: AMD OpenCL compiler wasting VGPRs like crazy

                  to disable unroll you must use #rpagma unroll 1 more in OpenCL AMD APP OpenCL programing guide section 5.8.1.


                  another possibility is that you specify 64 as required work-size. it is possible that compiler doesn't optimize register usage as there is plenty of them. you can try remove it or specify 256 as required work size so compiler is forced to conserve registers.


                  also why it this problem? as long there are no scratch register it shouldn't matter how many are used.

                    • Re: AMD OpenCL compiler wasting VGPRs like crazy

                      Ah, I see. I've tried "#pragma unroll 1" now, I've also tried using 256 as required work size, but neither helps. VGPR usage stays the same.


                      I thought that increasing occupancy would improve performance. Does it not? The reason I'm trying to limit VGPR usage is that CodeXL reports that VGPR limits occupancy to 10%.

                • Re: AMD OpenCL compiler wasting VGPRs like crazy

                  had the similar problem: VGPRs as intermediate storage


                  Compiler likes to use VGPR for every intermediate result like A+B if this A+B used more than once.

                  This maybe driver dependent (tested on 13.9), but you could try to fool compiler by using different expressions like

                  error *  0.0090 for the first time

                  and mad(error, 0.0090, 1) - 1 for the second time

                  setting required work group size to higher values also helps sometimes.

                  And sometimes there are no way to predict usage or understand logic.

                  On one my kernel i packed all my short variables into two short8 vectors and decreased VGPR number by 20.

                  On another kernel (very similar to the first) it gave +10 VGPRs.