4 Replies Latest reply on Aug 28, 2015 6:59 PM by boxerab

    Reducing VGPRS usage


      Are there guidelines for reducing VGPRS usage on AMD hardware?  CodeXL is telling me

      that VGPRS usage is limiting concurrency for my kernels, but I am not sure what actions to take.

        • Re: Reducing VGPRS usage

          Yes, use black magic!


          Otherwise it depends on the situation. Here's an example:

          The kernel reads a lot of values from ram, and then does calculations on them, finally writes them back into memory into the same addresses.

          In this case the compiler will likely to reserve additional registers to remember thos addresses, so they needed to calculate only once. But for the long time consuming calculations they are just use regester space for nothing.

          You can solve this by adding an input kernel parameter to the output addresses which is initialized to 0, so the compiler will think that the input and the output addresses are different, so it will not cache it into regs.


          Another way to play with pragma/unroll on loops.

          You can encapsulate blocks into while loops that iterate only 1 time.

          Sometimes the littlest change can do the biggest difference in gprs usage.

            • Re: Reducing VGPRS usage

              Thanks for advice. Sounds like a frustrating experience to out-guess the compiler.  Did you see improved performance when you made these hacks?


              It would be nice to hear from someone at AMD about the state of their compiler: can we expect better usage of VGPRs in the future?  For example, nVIDIA's CUDA
              compiler has a --maxregcount flag that can force the compiler to use fewer registers;  it would be useful to add this as an attribute to an OpenCL kernel.

                • Re: Reducing VGPRS usage

                  It's a complicated optimizer that compiles amd_il to low level isa. And the higher level compilers (opencl->llvmir->amd_il) has limited ability to control the amd_il->isa optimizer.

                  Also different Catalyst versions can optimize the same code differelntly.


                  Although I don't know how HSA works in this matter. Maybe that's totally different and you can control it much more. (If anyone knows, I'd be interested too.)


                  There's a third way: You can reach reach absolute driver independent freedom of optimization when you write everything in asm. But that's more work, and a lot more opportunities to do stupid bugs.

                  Me and some other guys are making little assemblers. I case if you're interested, just search for "gcn asm".