I have profiled my code, and one of my kernels has the characteristics as measured below.
<!-- BODY,DIV,TABLE,THEAD,TBODY,TFOOT,TR,TH,TD,P { font-family:"Arial"; font-size:x-small } -->
Method | foobar_05490B68 |
ExecutionOrder | 37 |
GlobalWorkSize | { 335104 1 1} |
GroupWorkSize | NULL |
Time | 24.29411 |
LDSSize | 0 |
DataTransferSize | |
GPRs | 90 |
ScratchRegs | 0 |
FCStacks | 3 |
Wavefronts | 5236.00 |
ALUInsts | 2220.68 |
FetchInsts | 41.03 |
WriteInsts | 14.05 |
LDSFetchInsts | 0.00 |
LDSWriteInsts | 0.00 |
ALUBusy | 13.26 |
ALUFetchRatio | 54.12 |
ALUPacking | 64.30 |
FetchSize | 211850.63 |
CacheHit | 0.00 |
FetchUnitBusy | 5.31 |
FetchUnitStalled | 3.27 |
WriteUnitStalled | 0.30 |
FastPath | 71791.13 |
CompletePath | 0.00 |
PathUtilization | 100.00 |
ALUStalledByLDS | 0.00 |
LDSBankConflict | 0.00
|
Okay... that posted a bit before I wanted to... anyhow, I have a couple questions about the bottleneck. It shows here that I have a very high ALUFetchRatio, but the ALU is only busy 13% of the time. I'm not sure what the performance bottleneck is in this code. I don't understand the cache measurement, because it's a bit ambiguous in the guide. My understanding is that the cache hit only pertains to images, not not global 1D arrays? Does anyone have an idea for where I should focus my optimization efforts on this code?
Your ALU packing isn't particularly high and I would guess that you have fairly short clauses so the hardware is executing constantly in the sequencer. Look at the code in the kernel analyzer. You want to see little of the control flow clause instructions and long ALU clauses with a lot of x, y, z, w, t occupancy keeping most of the work inside the SIMD unit rather than in scheduling hardware.
Normally that would mean doing a bit of manual loop unrolling. Sometimes just unrolling once will make a very large improvement (on all GPU hardware).
My kernel has been fully unrolled, and I use predication everywhere (select() in OpenCL). There is no divergence at any point in the kernel. What should I look at to determine the size of the clauses?
Here is some info from the ISA... it looks a lot like this throughout the file... would these be considered short clauses?
2061 x: ADD_INT ____, PV2060.w, T1.z ^M
y: CNDE_INT R123.y, R2.y, 0.0f, PV2060.y ^M
z: AND_INT R3.z, R0.z, (0x7F800000, 0.inff).x ^M
w: AND_INT T1.w, PS2060, (0x7F800000, 0.inff).x ^M
t: AND_INT T1.z, R0.z, (0x807FFFFF, -1.175494211e-38f).y ^M
2062 x: CNDE_INT R123.x, PV2061.y, 0.0f, -1 ^M
y: SETGE_INT ____, PV2061.x, (0x000000FF, 3.573311084e-43f).x ^M
z: AND_INT R1.z, T3.y, (0x807FFFFF, -1.175494211e-38f).y ^M
w: SETGE_INT ____, 0.0f, PV2061.x ^M
t: SETE_INT T2.x, PV2061.z, (0x7F800000, 0.inff).z ^M
2063 x: CNDE_INT R123.x, T3.z, PV2062.y, 0.0f ^M
y: SETE_INT T1.y, T1.w, (0x7F800000, 0.inff).x ^M
z: AND_INT ____, PV2062.x, (0x80000000, -0.0f).y ^M
w: CNDE_INT R123.w, PV2062.w, T3.x, T1.x ^M
t: SETE_INT T3.z, R3.z, 0.0f ^M
2064 x: CNDE_INT R4.x, PV2063.z, (0xBF800000, -1.0f).x, T0.x ^M
y: CNDE_INT R1.y, PV2063.z, (0xBF800000, -1.0f).x, T2.z VEC_021 ^M
z: CNDE_INT R2.z, PV2063.x, PV2063.w, T0.z VEC_120 ^M
w: SETE_INT T2.w, T1.w, 0.0f ^M
t: AND_INT T0.x, R0.z, (0x80000000, -0.0f).y ^M
2065 x: MULADD_e R123.x, PV2064.z, T0.w, R7.x ^M
y: MULADD_e R2.y, R3.y, PV2064.z, R7.x ^M
z: OR_INT ____, T1.z, (0x3F800000, 1.0f).x ^M
w: AND_INT ____, T3.y, (0x80000000, -0.0f).y VEC_120 ^M
t: OR_INT ____, R1.z, (0x3F800000, 1.0f).x ^M
2066 x: CNDE_INT R123.x, T3.z, PV2065.z, T0.x ^M
y: CNDE_INT R123.y, T2.w, PS2065, PV2065.w ^M
z: OR_INT ____, T2.x, T1.y ^M
w: ADD ____, -R36.x, PV2065.x VEC_120 ^M
t: OR_INT ____, T3.z, T2.w ^M
2067 x: OR_INT R0.x, PV2066.z, PS2066 ^M
y: CNDE_INT R123.y, T1.y, PV2066.y, T3.y ^M
z: CNDE_INT T3.z, T2.x, PV2066.x, R0.z VEC_021 ^M
w: SETGT_DX10 ____, 0.0f, PV2066.w ^M
t: SUB_INT ____, R3.z, T1.w ^M
2068 x: CNDE_INT T2.x, PV2067.x, PS2067, 0.0f ^M
Okay, sorry, I just realized these seem to all be part of the same clause. I'll read up on the ISA tomorrow so that I can take a peek at what has been generated.