Archives Discussions

aj_guillon · ‎10-31-2010

I have profiled my code, and one of my kernels has the characteristics as measured below.

Method	foobar_05490B68
ExecutionOrder	37
GlobalWorkSize	{ 335104 1 1}
GroupWorkSize	NULL
Time	24.29411
LDSSize	0
DataTransferSize
GPRs	90
ScratchRegs	0
FCStacks	3
Wavefronts	5236.00
ALUInsts	2220.68
FetchInsts	41.03
WriteInsts	14.05
LDSFetchInsts	0.00
LDSWriteInsts	0.00
ALUBusy	13.26
ALUFetchRatio	54.12
ALUPacking	64.30
FetchSize	211850.63
CacheHit	0.00
FetchUnitBusy	5.31
FetchUnitStalled	3.27
WriteUnitStalled	0.30
FastPath	71791.13
CompletePath	0.00
PathUtilization	100.00
ALUStalledByLDS	0.00
LDSBankConflict	0.00

aj_guillon · ‎10-31-2010

Okay... that posted a bit before I wanted to... anyhow, I have a couple questions about the bottleneck. It shows here that I have a very high ALUFetchRatio, but the ALU is only busy 13% of the time. I'm not sure what the performance bottleneck is in this code. I don't understand the cache measurement, because it's a bit ambiguous in the guide. My understanding is that the cache hit only pertains to images, not not global 1D arrays? Does anyone have an idea for where I should focus my optimization efforts on this code?

LeeHowes · ‎10-31-2010

Your ALU packing isn't particularly high and I would guess that you have fairly short clauses so the hardware is executing constantly in the sequencer. Look at the code in the kernel analyzer. You want to see little of the control flow clause instructions and long ALU clauses with a lot of x, y, z, w, t occupancy keeping most of the work inside the SIMD unit rather than in scheduling hardware.

Normally that would mean doing a bit of manual loop unrolling. Sometimes just unrolling once will make a very large improvement (on all GPU hardware).

aj_guillon · ‎11-01-2010

My kernel has been fully unrolled, and I use predication everywhere (select() in OpenCL). There is no divergence at any point in the kernel. What should I look at to determine the size of the clauses?

aj_guillon · ‎11-01-2010

Here is some info from the ISA... it looks a lot like this throughout the file... would these be considered short clauses?

    2061 x: ADD_INT     ____, PV2060.w, T1.z      ^M
         y: CNDE_INT    R123.y, R2.y, 0.0f, PV2060.y      ^M
         z: AND_INT     R3.z, R0.z, (0x7F800000, 0.inff).x      ^M
         w: AND_INT     T1.w, PS2060, (0x7F800000, 0.inff).x      ^M
         t: AND_INT     T1.z, R0.z, (0x807FFFFF, -1.175494211e-38f).y      ^M
    2062 x: CNDE_INT    R123.x, PV2061.y, 0.0f, -1      ^M
         y: SETGE_INT   ____, PV2061.x, (0x000000FF, 3.573311084e-43f).x      ^M
         z: AND_INT     R1.z, T3.y, (0x807FFFFF, -1.175494211e-38f).y      ^M
         w: SETGE_INT   ____, 0.0f, PV2061.x      ^M
         t: SETE_INT    T2.x, PV2061.z, (0x7F800000, 0.inff).z      ^M
    2063 x: CNDE_INT    R123.x, T3.z, PV2062.y, 0.0f      ^M
         y: SETE_INT    T1.y, T1.w, (0x7F800000, 0.inff).x      ^M
         z: AND_INT     ____, PV2062.x, (0x80000000, -0.0f).y      ^M
         w: CNDE_INT    R123.w, PV2062.w, T3.x, T1.x      ^M
         t: SETE_INT    T3.z, R3.z, 0.0f      ^M
    2064 x: CNDE_INT    R4.x, PV2063.z, (0xBF800000, -1.0f).x, T0.x      ^M
         y: CNDE_INT    R1.y, PV2063.z, (0xBF800000, -1.0f).x, T2.z      VEC_021 ^M
         z: CNDE_INT    R2.z, PV2063.x, PV2063.w, T0.z      VEC_120 ^M
         w: SETE_INT    T2.w, T1.w, 0.0f      ^M
         t: AND_INT     T0.x, R0.z, (0x80000000, -0.0f).y      ^M
    2065 x: MULADD_e    R123.x, PV2064.z, T0.w, R7.x      ^M
         y: MULADD_e    R2.y, R3.y, PV2064.z, R7.x      ^M
         z: OR_INT      ____, T1.z, (0x3F800000, 1.0f).x      ^M
         w: AND_INT     ____, T3.y, (0x80000000, -0.0f).y      VEC_120 ^M
         t: OR_INT      ____, R1.z, (0x3F800000, 1.0f).x      ^M
    2066 x: CNDE_INT    R123.x, T3.z, PV2065.z, T0.x      ^M
         y: CNDE_INT    R123.y, T2.w, PS2065, PV2065.w      ^M
         z: OR_INT      ____, T2.x, T1.y      ^M
         w: ADD         ____, -R36.x, PV2065.x      VEC_120 ^M
         t: OR_INT      ____, T3.z, T2.w      ^M
    2067 x: OR_INT      R0.x, PV2066.z, PS2066      ^M
         y: CNDE_INT    R123.y, T1.y, PV2066.y, T3.y      ^M
         z: CNDE_INT    T3.z, T2.x, PV2066.x, R0.z      VEC_021 ^M
         w: SETGT_DX10 ____, 0.0f, PV2066.w      ^M
         t: SUB_INT     ____, R3.z, T1.w      ^M
    2068 x: CNDE_INT    T2.x, PV2067.x, PS2067, 0.0f      ^M

aj_guillon · ‎11-01-2010

Okay, sorry, I just realized these seem to all be part of the same clause. I'll read up on the ISA tomorrow so that I can take a peek at what has been generated.

Archives Discussions

Where's My Bottleneck?