5 Replies Latest reply on Nov 1, 2010 2:29 AM by aj_guillon

    Where's My Bottleneck?

    aj_guillon

      I have profiled my code, and one of my kernels has the characteristics as measured below.

       

      <!-- BODY,DIV,TABLE,THEAD,TBODY,TFOOT,TR,TH,TD,P { font-family:"Arial"; font-size:x-small } -->

      Methodfoobar_05490B68
      ExecutionOrder37
      GlobalWorkSize{ 335104 1 1}
      GroupWorkSizeNULL
      Time24.29411
      LDSSize0
      DataTransferSize
      GPRs90
      ScratchRegs0
      FCStacks3
      Wavefronts5236.00
      ALUInsts2220.68
      FetchInsts41.03
      WriteInsts14.05
      LDSFetchInsts0.00
      LDSWriteInsts0.00
      ALUBusy13.26
      ALUFetchRatio54.12
      ALUPacking64.30
      FetchSize211850.63
      CacheHit0.00
      FetchUnitBusy5.31
      FetchUnitStalled3.27
      WriteUnitStalled0.30
      FastPath71791.13
      CompletePath0.00
      PathUtilization

      100.00

      ALUStalledByLDS0.00
      LDSBankConflict

      0.00

       



        • Where's My Bottleneck?
          aj_guillon

          Okay... that posted a bit before I wanted to... anyhow, I have a couple questions about the bottleneck.  It shows here that I have a very high ALUFetchRatio, but the ALU is only busy 13% of the time.  I'm not sure what the performance bottleneck is in this code.  I don't understand the cache measurement, because it's a bit ambiguous in the guide.  My understanding is that the cache hit only pertains to images, not not global 1D arrays?  Does anyone have an idea for where I should focus my optimization efforts on this code?

            • Where's My Bottleneck?
              LeeHowes

              Your ALU packing isn't particularly high and I would guess that you have fairly short clauses so the hardware is executing constantly in the sequencer. Look at the code in the kernel analyzer. You want to see little of the control flow clause instructions and long ALU clauses with a lot of x, y, z, w, t occupancy keeping most of the work inside the SIMD unit rather than in scheduling hardware.

              Normally that would mean doing a bit of manual loop unrolling. Sometimes just unrolling once will make a very large improvement (on all GPU hardware).

                • Where's My Bottleneck?
                  aj_guillon

                  My kernel has been fully unrolled, and I use predication everywhere (select() in OpenCL).  There is no divergence at any point in the kernel.  What should I look at to determine the size of the clauses?

                    • Where's My Bottleneck?
                      aj_guillon

                      Here is some info from the ISA... it looks a lot like this throughout the file... would these be considered short clauses?

                       

                          2061  x: ADD_INT     ____,  PV2060.w,  T1.z      ^M
                               y: CNDE_INT    R123.y,  R2.y,  0.0f,  PV2060.y      ^M
                               z: AND_INT     R3.z,  R0.z,  (0x7F800000, 0.inff).x      ^M
                               w: AND_INT     T1.w,  PS2060,  (0x7F800000, 0.inff).x      ^M
                               t: AND_INT     T1.z,  R0.z,  (0x807FFFFF, -1.175494211e-38f).y      ^M
                          2062  x: CNDE_INT    R123.x,  PV2061.y,  0.0f,  -1      ^M
                               y: SETGE_INT   ____,  PV2061.x,  (0x000000FF, 3.573311084e-43f).x      ^M
                               z: AND_INT     R1.z,  T3.y,  (0x807FFFFF, -1.175494211e-38f).y      ^M
                               w: SETGE_INT   ____,  0.0f,  PV2061.x      ^M
                               t: SETE_INT    T2.x,  PV2061.z,  (0x7F800000, 0.inff).z      ^M
                          2063  x: CNDE_INT    R123.x,  T3.z,  PV2062.y,  0.0f      ^M
                               y: SETE_INT    T1.y,  T1.w,  (0x7F800000, 0.inff).x      ^M
                               z: AND_INT     ____,  PV2062.x,  (0x80000000, -0.0f).y      ^M
                               w: CNDE_INT    R123.w,  PV2062.w,  T3.x,  T1.x      ^M
                               t: SETE_INT    T3.z,  R3.z,  0.0f      ^M
                          2064  x: CNDE_INT    R4.x,  PV2063.z,  (0xBF800000, -1.0f).x,  T0.x      ^M
                               y: CNDE_INT    R1.y,  PV2063.z,  (0xBF800000, -1.0f).x,  T2.z      VEC_021 ^M
                               z: CNDE_INT    R2.z,  PV2063.x,  PV2063.w,  T0.z      VEC_120 ^M
                               w: SETE_INT    T2.w,  T1.w,  0.0f      ^M
                               t: AND_INT     T0.x,  R0.z,  (0x80000000, -0.0f).y      ^M
                          2065  x: MULADD_e    R123.x,  PV2064.z,  T0.w,  R7.x      ^M
                               y: MULADD_e    R2.y,  R3.y,  PV2064.z,  R7.x      ^M
                               z: OR_INT      ____,  T1.z,  (0x3F800000, 1.0f).x      ^M
                               w: AND_INT     ____,  T3.y,  (0x80000000, -0.0f).y      VEC_120 ^M
                               t: OR_INT      ____,  R1.z,  (0x3F800000, 1.0f).x      ^M
                          2066  x: CNDE_INT    R123.x,  T3.z,  PV2065.z,  T0.x      ^M
                               y: CNDE_INT    R123.y,  T2.w,  PS2065,  PV2065.w      ^M
                               z: OR_INT      ____,  T2.x,  T1.y      ^M
                               w: ADD         ____, -R36.x,  PV2065.x      VEC_120 ^M
                               t: OR_INT      ____,  T3.z,  T2.w      ^M
                          2067  x: OR_INT      R0.x,  PV2066.z,  PS2066      ^M
                               y: CNDE_INT    R123.y,  T1.y,  PV2066.y,  T3.y      ^M
                               z: CNDE_INT    T3.z,  T2.x,  PV2066.x,  R0.z      VEC_021 ^M
                               w: SETGT_DX10  ____,  0.0f,  PV2066.w      ^M
                               t: SUB_INT     ____,  R3.z,  T1.w      ^M
                          2068  x: CNDE_INT    T2.x,  PV2067.x,  PS2067,  0.0f      ^M