4 Replies Latest reply on Oct 29, 2009 8:32 AM by Jawed

    Why so different GPR requirements ?

    Raistmer
      2 kernels listed

      First of attached kernels reqires (accordingly to SKA ) only 7 registers while second one - 18 !

      But there is no change in directly declared registers number.
      Why second one requires so many registers ?

      kernel void GPU_coadd_and_compare_kernel54t_s8(float4 src[][], int thresholds[][],int level,out float4 dest<>, out float4 dest1<>,out int4 output<>,out int4 output1<> ) { int threadID=instance().x; float threshold=(float)thresholds[level][threadID]; int was_signal=0; int bin=0; int4 s; float4 i1,i2; float4 o11; float4 o21; i1=src[0][threadID]; i2=src[1][threadID]; if(i2.w>=threshold){ was_signal++; bin=7; } if(i2.z>=threshold){ was_signal++; bin=6; } if(i2.y>=threshold){ was_signal++; bin=5; } if(i2.x>=threshold){ was_signal++; bin=4; } if(i1.w>=threshold){ was_signal++; bin=3; } if(i1.z>=threshold){ was_signal++; bin=2; } if(i1.y>=threshold){ was_signal++; bin=1; } if(i1.x>=threshold){ was_signal++; bin=0; } s.x=was_signal; s.y=bin; o11.xy=i1.xz+i1.yw; o11.zw=i2.xz+i2.yw; dest=o11; was_signal=0; threshold=(float)thresholds[level+1][threadID]; if(o11.w>=threshold){ was_signal++; bin=3; } if(o11.z>=threshold){ was_signal++; bin=2; } if(o11.y>=threshold){ was_signal++; bin=1; } if(o11.x>=threshold){ was_signal++; bin=0; } threshold=(float)thresholds[level+2][threadID]; s.z=was_signal; s.w=bin; output=s; o21.xy=o11.xz+o11.yw; dest1=o21; was_signal=0; if(o21.y>=threshold){ was_signal++; bin=1; } if(o21.x>=threshold){ was_signal++; bin=0; } s.x=was_signal; s.y=bin; output1=s; } kernel void GPU_strideadd_and_compare_kernel54t_s8(float4 src[][], int thresholds[][],int level, out float4 dest<>/*s4*/, out float4 dest1<>/*s2*/, out float4 d2<>/*s4folded*/,out float4 d3<>/*s2afters4fold*/, out int4 output<>/*s8&s4*/,out int4 output1<>/*s2&s4folded*/,out int4 output2<>/*s2after fold*/) { int threadID=instance().x; float4 threshold; int was_signal=0; int bin=0; int4 s; float4 i1,i2; float4 o11; float4 o21; threshold.x=(float)thresholds[level][threadID]; threshold.y=(float)thresholds[level+1][threadID]; threshold.z=(float)thresholds[level+2][threadID]; i1=src[0][threadID]; i2=src[1][threadID]; if(i2.w>=threshold.x){ was_signal++; bin=7; } if(i2.z>=threshold.x){ was_signal++; bin=6; } if(i2.y>=threshold.x){ was_signal++; bin=5; } if(i2.x>=threshold.x){ was_signal++; bin=4; } if(i1.w>=threshold.x){ was_signal++; bin=3; } if(i1.z>=threshold.x){ was_signal++; bin=2; } if(i1.y>=threshold.x){ was_signal++; bin=1; } if(i1.x>=threshold.x){ was_signal++; bin=0; } s.x=was_signal; s.y=bin; o11.xy=i1.xz+i1.yw; o11.zw=i2.xz+i2.yw; dest=o11; was_signal=0; if(o11.w>=threshold.y){ was_signal++; bin=3; } if(o11.z>=threshold.y){ was_signal++; bin=2; } if(o11.y>=threshold.y){ was_signal++; bin=1; } if(o11.x>=threshold.y){ was_signal++; bin=0; } s.z=was_signal; s.w=bin; output=s; o21.xy=o11.xz+o11.yw; dest1=o21; was_signal=0; if(o21.y>=threshold.z){ was_signal++; bin=1; } if(o21.x>=threshold.z){ was_signal++; bin=0; } s.x=was_signal; s.y=bin; //S8 coadd finished, now do stride add and repeat coadd for S4 was_signal=0; i1.xyzw=i1.xyzw+i2.xyzw;//fold S8 to S4 d2=i1; if(i1.w>=threshold.y){ was_signal++; bin=3; } if(i1.z>=threshold.y){ was_signal++; bin=2; } if(i1.y>=threshold.y){ was_signal++; bin=1; } if(i1.x>=threshold.y){ was_signal++; bin=0; } s.z=was_signal; s.w=bin; output1=s; //now coadd was_signal=0; o11.xy=i1.xz+i1.yw; d3=o11; if(o11.y>=threshold.z){ was_signal++; bin=1; } if(o11.x>=threshold.z){ was_signal++; bin=0; } s.x=was_signal; s.y=bin; output2=s; }

        • Why so different GPR requirements ?
          Jawed

          The second kernel is computing more output data. Instead of computing each output one after the other, it can do parts of each computation in parallel for several of the outputs. Doing so increases the count of registers used.

          Jawed

            • Why so different GPR requirements ?
              Raistmer
              And how do you think, does compiler do right job here?

              AFAIK excessive register use will reduce number of active threads that SIMD can schedule. Less threads - worse memory access latency hiding... Slower overall kernel execution.
              Or I'm missing something here ?
                • Why so different GPR requirements ?
                  riza.guntur

                  I don't think so

                  ATI hardware less prone to registers pressure

                    • Why so different GPR requirements ?
                      Jawed

                      Since your kernel has no loop, it's very easy to read off the statistics reported by SKA. On HD4870 SKA reports that both kernels have the same bottleneck, "Global Write". In theory this means that the register allocation has no effect.

                      Also, SKA reports ALU:Fetch. This is 2 for the first kernel and 3.5 for the second. In general the higher this number the less latency hiding is needed, so higher register allocation is OK. Fetch latency isn't the only kind of latency a kernel can experience, but it is often the most important.

                      So ALU:Fetch of 3.5 for the second kernel means there's little chance of fetch latencies causing a performance problem. And, as I said earlier, the fact the kernel is bottlenecked by the outputs means this chance is low. The 7 outputs in the second kernel mean that your kernel can have up to 70 ALU instructions without getting any slower (each output can have 10 ALU instructions).

                      AMD recommends a minimum of 5 hardware threads and suggests that 3 is the worst case that wil be able to hide latencies. 5 hardware threads corresponds with about 51 GPRs and 3 hardware threads with about 85 GPRs.

                      If the kernel has one or more loops in it, then you have to identify the loop that consumes the most time, and then count the number of ALU instructions and count the number of fetch instructions yourself, to determine the ALU:Fetch ratio. You have to account for the hardware configuration of the ALUs - HD4870 can execute 4 ALU instructions for every single fetch instruction. So when you count the ALU instructions yourself, divide by 4, and then divide by the number of fetches to get the ALU:Fetch number SKA reports.

                      If a kernel is Global Write bottlenecked, due to the outputs, then ALU:Fetch is calculated on the basis that there are 10 ALU instructions per write (on HD4870, this number varies depending on the card). So the first kernel has 4 outputs, which means 40 ALU instructions. Dividing by 4 results in 10, and then dividing by the 5 fetch instructions results in ALU:Fetch = 2, which is what you see reported in SKA.

                      Jawed