cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

aj_guillon
Adept I

Where's My Bottleneck?

I have profiled my code, and one of my kernels has the characteristics as measured below.

 

<!-- BODY,DIV,TABLE,THEAD,TBODY,TFOOT,TR,TH,TD,P { font-family:"Arial"; font-size:x-small } -->

Methodfoobar_05490B68
ExecutionOrder37
GlobalWorkSize{ 335104 1 1}
GroupWorkSizeNULL
Time24.29411
LDSSize0
DataTransferSize
GPRs90
ScratchRegs0
FCStacks3
Wavefronts5236.00
ALUInsts2220.68
FetchInsts41.03
WriteInsts14.05
LDSFetchInsts0.00
LDSWriteInsts0.00
ALUBusy13.26
ALUFetchRatio54.12
ALUPacking64.30
FetchSize211850.63
CacheHit0.00
FetchUnitBusy5.31
FetchUnitStalled3.27
WriteUnitStalled0.30
FastPath71791.13
CompletePath0.00
PathUtilization

100.00

ALUStalledByLDS0.00
LDSBankConflict

0.00

 



0 Likes
5 Replies
aj_guillon
Adept I

Okay... that posted a bit before I wanted to... anyhow, I have a couple questions about the bottleneck.  It shows here that I have a very high ALUFetchRatio, but the ALU is only busy 13% of the time.  I'm not sure what the performance bottleneck is in this code.  I don't understand the cache measurement, because it's a bit ambiguous in the guide.  My understanding is that the cache hit only pertains to images, not not global 1D arrays?  Does anyone have an idea for where I should focus my optimization efforts on this code?

0 Likes

Your ALU packing isn't particularly high and I would guess that you have fairly short clauses so the hardware is executing constantly in the sequencer. Look at the code in the kernel analyzer. You want to see little of the control flow clause instructions and long ALU clauses with a lot of x, y, z, w, t occupancy keeping most of the work inside the SIMD unit rather than in scheduling hardware.

Normally that would mean doing a bit of manual loop unrolling. Sometimes just unrolling once will make a very large improvement (on all GPU hardware).

0 Likes

My kernel has been fully unrolled, and I use predication everywhere (select() in OpenCL).  There is no divergence at any point in the kernel.  What should I look at to determine the size of the clauses?

0 Likes

Here is some info from the ISA... it looks a lot like this throughout the file... would these be considered short clauses?

 

    2061  x: ADD_INT     ____,  PV2060.w,  T1.z      ^M
         y: CNDE_INT    R123.y,  R2.y,  0.0f,  PV2060.y      ^M
         z: AND_INT     R3.z,  R0.z,  (0x7F800000, 0.inff).x      ^M
         w: AND_INT     T1.w,  PS2060,  (0x7F800000, 0.inff).x      ^M
         t: AND_INT     T1.z,  R0.z,  (0x807FFFFF, -1.175494211e-38f).y      ^M
    2062  x: CNDE_INT    R123.x,  PV2061.y,  0.0f,  -1      ^M
         y: SETGE_INT   ____,  PV2061.x,  (0x000000FF, 3.573311084e-43f).x      ^M
         z: AND_INT     R1.z,  T3.y,  (0x807FFFFF, -1.175494211e-38f).y      ^M
         w: SETGE_INT   ____,  0.0f,  PV2061.x      ^M
         t: SETE_INT    T2.x,  PV2061.z,  (0x7F800000, 0.inff).z      ^M
    2063  x: CNDE_INT    R123.x,  T3.z,  PV2062.y,  0.0f      ^M
         y: SETE_INT    T1.y,  T1.w,  (0x7F800000, 0.inff).x      ^M
         z: AND_INT     ____,  PV2062.x,  (0x80000000, -0.0f).y      ^M
         w: CNDE_INT    R123.w,  PV2062.w,  T3.x,  T1.x      ^M
         t: SETE_INT    T3.z,  R3.z,  0.0f      ^M
    2064  x: CNDE_INT    R4.x,  PV2063.z,  (0xBF800000, -1.0f).x,  T0.x      ^M
         y: CNDE_INT    R1.y,  PV2063.z,  (0xBF800000, -1.0f).x,  T2.z      VEC_021 ^M
         z: CNDE_INT    R2.z,  PV2063.x,  PV2063.w,  T0.z      VEC_120 ^M
         w: SETE_INT    T2.w,  T1.w,  0.0f      ^M
         t: AND_INT     T0.x,  R0.z,  (0x80000000, -0.0f).y      ^M
    2065  x: MULADD_e    R123.x,  PV2064.z,  T0.w,  R7.x      ^M
         y: MULADD_e    R2.y,  R3.y,  PV2064.z,  R7.x      ^M
         z: OR_INT      ____,  T1.z,  (0x3F800000, 1.0f).x      ^M
         w: AND_INT     ____,  T3.y,  (0x80000000, -0.0f).y      VEC_120 ^M
         t: OR_INT      ____,  R1.z,  (0x3F800000, 1.0f).x      ^M
    2066  x: CNDE_INT    R123.x,  T3.z,  PV2065.z,  T0.x      ^M
         y: CNDE_INT    R123.y,  T2.w,  PS2065,  PV2065.w      ^M
         z: OR_INT      ____,  T2.x,  T1.y      ^M
         w: ADD         ____, -R36.x,  PV2065.x      VEC_120 ^M
         t: OR_INT      ____,  T3.z,  T2.w      ^M
    2067  x: OR_INT      R0.x,  PV2066.z,  PS2066      ^M
         y: CNDE_INT    R123.y,  T1.y,  PV2066.y,  T3.y      ^M
         z: CNDE_INT    T3.z,  T2.x,  PV2066.x,  R0.z      VEC_021 ^M
         w: SETGT_DX10  ____,  0.0f,  PV2066.w      ^M
         t: SUB_INT     ____,  R3.z,  T1.w      ^M
    2068  x: CNDE_INT    T2.x,  PV2067.x,  PS2067,  0.0f      ^M

0 Likes

Okay, sorry, I just realized these seem to all be part of the same clause.  I'll read up on the ISA tomorrow so that I can take a peek at what has been generated.

0 Likes