5 Replies Latest reply on Feb 14, 2014 5:41 AM by aazmp

    VGPRs as intermediate storage

    aazmp

      how to prevent usage of VGPRs for storing intermediate data, that compiler find to be useful in future?

       

      example:

       

      In my program each kernels stores 8x8 pixel(8bit) data to Local memory; pixels are packed into uchar4 by 4.

      position of quadropixel is   array_in_LDS[get_local_size(0)*qpixel_num + get_local_id(0)]

       

      At some time i place them in local memory and after some time load from local memory.

      between these two times compiler stores each index in VGPR => 16 vgprs are used in this example

       

      There is a workaround:

      if at store time calculate index differently, for example mad24(get_local_size(0), qpixel_num, get_local_id(0)),

      then no additional VGPRs are used

       

      Calculating index is ~5 ticks or even 1 with mad24, while VGPRs are always expensive.

      Are there any solutions to prevent storing such easy calculated data without fooling compiler? Maybe some directive or some option to clBuildProgram?

        • Re: VGPRs as intermediate storage
          realhet

          I also had to use this trick and I'm quiet sure that this behaviour is uncontrollable at the OpenCL level. Because all these register allocations and optimizations are handled by the AMD_IL compiler and there are no way to say it in AMD_IL that a particular intermediate calculation can be stored in regs or must be recalculated every time.

          The mul24 trick is cool. In my problem I had to 'OR' the data with a 0 which was on a constant buffer.

            • Re: VGPRs as intermediate storage
              aazmp

              Thanks!

               

              P.S.

              VGPR usage is strange for every piece of code and jumps up and down unpredictably.

               

              with this fragment in code i get 104 VGPRs used

              x = x0/4; y = y0/4;

              //some code between lines using x,y

              x = (x0 + 16)/4; y = y0/4;

              x = x0/4; y = (y0 + 16)/4;

              x = (x0 + 16)/4; y = (y0 + 16)/4;

               

              with this, which is simpler in every way - 110 VGPRS

              x = x0/4; y = y0/4;

              x += 4;

              x -= 4; y += 4;

              x += 4;

               

              and with mix - 102 VGPR

              x = x0/4; y = y0/4;

              x += 4;

              x = x0/4; y = (y0 + 16)/4;

              x = (x0 + 16)/4; y = (y0 + 16)/4;

               

              I think there is no point to even try to optimize

            • Re: VGPRs as intermediate storage
              aazmp

              I)

              Just a story.

              Found in my kernel place where i could store data to __local memory to avoid reading it from __global later;

              Started to impelement it by small step.

               

              Kernel launch options: GlobalWorkSize={14592;1;1} WorkGroupSize={256,1,1} HD7850

              before optimization 123 VGPRs 34 SGPRs 0 Scratch

               

              step 1) added  a local storage for 6x4 bytes without using it:

              #define GR_SIZE 256

              __local uchar4 prefetched[6*GR_SIZE];

              VGPR usage went from 123 to 121

              (adding more unused variables and allocation done nothing)

               

              step 2) added one store line (but never load back)

              prefetched[(int)lid] = (*lap4p2).s0123; //vector passed into function by pointer

              VGPRs went from 121 to 128

              SGPRs from 34 to 57

              Scratch regs from 0 to 1320

              Execution time from 23ms to 208ms

               

              step 3) added all 6 store lines

              prefetched[(int)lid] = (*lap4p2).s0123;

              prefetched[(int)(lsz+lid)] = (*lap4p2).s4567;

              prefetched[(int)mad24(lsz,2,lid)] = (*l).s0123;

              prefetched[(int)mad24(lsz,3,lid)] = (*l).s4567;

              prefetched[(int)mad24(lsz,4,lid)] = (*l).s89AB;

              prefetched[(int)mad24(lsz,5,lid)] = (*l).sCDEF;

               

              VGPRs: 128 -> 256

              SGPRs: 57 -> 44

              Scratched: 1320 -> 384

              time: 208ms -> 28ms

               

              ...

               

              II)

              Also observation:

              for(i = 0; i < 3; i+=1)

              {

                   if (i%2==0) very_large_func_1(i);

                   else very_large_func_2(i)

              }

              works faster (by 10% in my case) and produce (twice) less VGPRs than

               

              for (i = 0; i < 3; i+=2)

              {

                   j = i;

                   very_large_func_1(j);

                   j = i + 1;

                   very_large_func_2(j)

              }

               

              it seems that compiler tries to parallelize code by variable renaming and you can't force serial code by using

              a = b; do1(a);

              a = b+1; do2(a);

              and have to fight it with IFs FORs and PRAGMA_UNROLLs

               

              III)

              and another example

              had a code like this

              do(a[0]); do(a[1]); do(a[2]); do(a[3]); do(a[4]); do(a[5]);

              it used 137 VGPRs

              tried this

              for(i = 0; i < 6; ++i) do(a[i]);

              it produced wrong result (i could make a mistake somewhere)

               

              but then just combined two parts (unlooped rewrite looped results)

              for(i = 0; i < 6; ++i) do(a[i]);

              do(a[0]); do(a[1]); do(a[2]); do(a[3]); do(a[4]); do(a[5]);

              output became correct and VGPR usage dropped from 137 to 114!!!

                • Re: VGPRs as intermediate storage
                  realhet

                  Hi,

                   

                  II) Maybe I see it wrong but those two loops would be equal only if: for(i = 0; i < 4; i+=1)

                  The first version which is 10% faster at you only issues 3 very_large_func while the second version issues 4.

                  It should be 33% faster, though.

                   

                  Just some thoughts: Your card has 1024 streams, so global work size is not so good:

                  If you go for 128 VGPRS, then you should set global work size to a multiple of 8*1024 in order to fill the CUes with tasks all the time.

                  256 VGPRS -> multiple of 4096

                  So unless your program is memory bound or heavily divergent, you could simply raise global work size to 16K with no cost.

                   

                  Low VGPRS usage is only important if you have to hide memory IO latency: to fill all the 16 CUes with not 4 but 8 or even 10 wavefronts so it will be able to give those 4 vector ALUs work all the time while another 4 or 6 wavefronts are waiting for IO.

                    • Re: VGPRs as intermediate storage
                      aazmp

                      Thanks.

                       

                      With 3/4 my mistake in message. In real kernel loop does 4 steps.

                      Tried setting global size to multiple of 1024, made no effect. Originally it was multiple of work group size(256).

                       

                      P.S

                      In the end managed to decrease VGPRs to 103 and gain speed by 20% (but this mostly to LDS usage i think).

                      Also found these actions to lower VGPR usage

                      1) pack data to preffered size vectors (not always)

                      2) using scalar operations even on vectors

                      a.x +b.x ... a.z + b.z  instead of a+b

                      3) not using manually unlooped code

                      even if in loop have to write something like

                      loop (i = 0..3) {

                           do something with D

                           A = (i==0) ? D : A;

                           B = (i==1) ? D : B;

                           C = (i==2) ? D : C

                      }

                      it works a little faster (and uses less vgprs)

                      than

                      do something with A

                      do something with B

                      do something with C

                      do something with D