I had previously reported this but here I provide a test case for the examination of workgroup reduction function. Kernels perform a workgroup reduction in 3 ways:
1) The classical one with shared memory (OpenCL 1.2)
2) Shared memory plus sub-group reduction function on the final stage
3) Workgroup reduction function (no shared memory at all)
I tested it on a R7-260X and the latter two kernels prove to be significantly slower than reduction in shared memory. The last one especially is more than 5 times slower than using pure shared memory. This fact eliminates the value of these new functions in OpenCL 2.0. AFAIK, GCN GPUs feature swizzle operations which would could potentially make workgroup functions quite efficient. This is not the case however.
In addition the CodeXL 1.6 static kernel analyser does not support OpenCL 2.0 kernels. Therefore, I cannot investigate the disassembled kernel code.
Code on Github: https://github.com/ekondis/cl2-reduce-bench
More details on blog: http://parallelplusplus.blogspot.gr/2014/12/workgroup-reduction-function-evaluation.html
Thanks for sharing the testcase. We'll check and get back to you.
Regards,
I usually don't trust any measurement taking less than 4-5 ms. On my system, those are normal fluctuations! I assume using profile mode takes care of that...
... I'm pretty sure CL profile mode has been invented to be used, I don't trust it. As most APIs have historically been easygoing on the profile data, I wouldn't take for granted the results are coherent...
besides I cannot exclude my app had a small overhead with profile on.
While the problem of reductions might be useful for some cases, I don't see value in a kernel doing only this, it's not relevant to me. In general, I've noticed using advanced functionality makes simple kernels slower (I have been told they take more time to "set up") while making complex kernels usually faster. This kernel is super simple, its only memory access is in writing out the result. There's no other workload and this severely hampers HW ability to do things optimally.
The results seem to be quite stable. No significant fluctuations between executions have been observed. Even if profiling contributes extra latency it should be mostly identical on all kernel executions. And it's a comparison between the three executions after all.
The kernel is deliberately simple as I want to expose the performance of the specific workgroup function and not to include other factors that might affect performance. The three kernels differ only in the reduction part since it is the focus of the comparison.
Just to mention that I've seen in a few known GPU benchmarks kernel executions that take just a couple of milliseconds (e.g. Rodinia).
When someone argues that your test methodology is wrong, the answer is to run you test millions of times, and show the distribution of the running times.
...or the one who argues could supply results (millions?) in order to support his argument. The truth is that I don't have time to prove the obvious.
One thousand times would certainly enough by far, I was kidding...It would only takes seconds...
Unfortunately, I still haven't got any reply on the issue.
However, I got a reminder note about saving temporary files with the "-save-temps" compilation option, so I was able to get the disassembly code of all 3 kernels.
The results are quite revealing of the computational cost each kernel. I'll focus on the last kernel which makes use of the workgroup reduction function. The kernel's code is quite simple:
__kernel void reductionWkgrp(__global uint *result) {
const uint id = get_global_id(0);
const uint lid = get_local_id(0);
uint res = id;
// workgroup reduction (introduced in OpenCL 2.0)
res = work_group_reduce_add(res);
// atomic reduce in global memory
if(lid==0)
atomic_add(result, res);
}
The generated disassembly code on a Bonaire GPU is enormous:
ShaderType = IL_SHADER_COMPUTE
TargetChip = t
; ------------- SC_SRCSHADER Dump ------------------
SC_SHADERSTATE: u32NumIntVSConst = 0
SC_SHADERSTATE: u32NumIntPSConst = 0
SC_SHADERSTATE: u32NumIntGSConst = 0
SC_SHADERSTATE: u32NumBoolVSConst = 0
SC_SHADERSTATE: u32NumBoolPSConst = 0
SC_SHADERSTATE: u32NumBoolGSConst = 0
SC_SHADERSTATE: u32NumFloatVSConst = 0
SC_SHADERSTATE: u32NumFloatPSConst = 0
SC_SHADERSTATE: u32NumFloatGSConst = 0
u32SCOptions[0] = 0x00680000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC
u32SCOptions[1] = 0x00000000
u32SCOptions[2] = 0x08000000 SCOption_R1000_READLANE_SMRD_WORKAROUND_BUG343479
u32SCOptions[3] = 0x00001284 SCOption_R1000_BARRIER_WORKAROUND_BUG405404 SCOption_R1000_IFORK_BACKJUMP_WORKAROUND_BUG368004 SCOption_R1000R1100_VCCZ_CLOBBER_WORKAROUND_BUG457939 SCOption_R1100_FLAT_SCRATCH_OOB_WORKAROUND_BUG379895
u32SCOptions[4] = 0x00000000
; -------- Disassembly --------------------
shader main
asic(CI)
type(CS)
s_load_dwordx4 s[12:15], s[4:5], 0x01 // 000000000000: C0860501
s_mov_b32 m0, 0x00010000 // 000000000004: BEFC03FF 00010000
s_waitcnt lgkmcnt(0) // 00000000000C: BF8C007F
s_bfe_u32 s11, s12, 0x00100000 // 000000000010: 938BFF0C 00100000
s_bfe_u32 s12, s12, 0x00100010 // 000000000018: 938CFF0C 00100010
s_bfe_u32 s13, s13, 0x00100000 // 000000000020: 938DFF0D 00100000
v_mul_lo_u32 v2, v2, s12 // 000000000028: D2D20002 00001902
v_add_i32 v1, vcc, v2, v1 // 000000000030: 4A020302
v_mul_lo_u32 v1, v1, s11 // 000000000034: D2D20001 00001701
v_add_i32 v1, vcc, v1, v0 // 00000000003C: 4A020101
s_cmp_eq_u32 0, s11 // 000000000040: BF060B80
s_cselect_b32 s0, -1, 0 // 000000000044: 850080C1
s_cmp_eq_u32 0, s11 // 000000000048: BF060B80
s_cbranch_scc1 label_0030 // 00000000004C: BF85001C
s_cmp_gt_u32 s11, s14 // 000000000050: BF080E0B
s_cbranch_scc1 label_0030 // 000000000054: BF85001A
s_flbit_i32_b32 s1, s11 // 000000000058: BE81150B
s_flbit_i32_b32 s2, s14 // 00000000005C: BE82150E
s_sub_i32 s20, s1, s2 // 000000000060: 81940201
s_lshl_b32 s1, s11, s20 // 000000000064: 8F01140B
s_sub_u32 s17, s14, s1 // 000000000068: 8091010E
s_cselect_b32 s21, s14, s17 // 00000000006C: 8515110E
s_cselect_b32 s18, 0, 1 // 000000000070: 85128180
s_add_u32 s3, s0, s18 // 000000000074: 80031200
s_cmp_eq_u32 0, s20 // 000000000078: BF061480
s_cbranch_scc0 label_0022 // 00000000007C: BF840002
s_mov_b32 s0, s3 // 000000000080: BE800303
s_branch label_0030 // 000000000084: BF82000E
label_0022:
s_mov_b32 s2, s3 // 000000000088: BE820303
label_0023:
s_add_u32 s2, s2, s2 // 00000000008C: 80020202
s_lshr_b32 s1, s1, 1 // 000000000090: 90018101
s_sub_u32 s18, s21, s1 // 000000000094: 80920115
s_cselect_b32 s17, s21, s18 // 000000000098: 85111215
s_cselect_b32 s19, 0, 1 // 00000000009C: 85138180
s_add_u32 s16, s2, s19 // 0000000000A0: 80101302
s_addk_i32 s20, 0xffff // 0000000000A4: B794FFFF
s_cmp_le_u32 s20, 0 // 0000000000A8: BF0B8014
s_cbranch_scc1 label_002F // 0000000000AC: BF850003
s_mov_b32 s2, s16 // 0000000000B0: BE820310
s_mov_b32 s21, s17 // 0000000000B4: BE950311
s_branch label_0023 // 0000000000B8: BF82FFF4
label_002F:
s_mov_b32 s0, s16 // 0000000000BC: BE800310
label_0030:
s_mul_i32 s1, s0, s11 // 0000000000C0: 93010B00
s_sub_i32 s1, s14, s1 // 0000000000C4: 8181010E
s_cmp_eq_u32 0, s12 // 0000000000C8: BF060C80
s_cselect_b32 s2, -1, 0 // 0000000000CC: 850280C1
s_cmp_eq_u32 0, s12 // 0000000000D0: BF060C80
s_cbranch_scc1 label_0052 // 0000000000D4: BF85001C
s_cmp_gt_u32 s12, s15 // 0000000000D8: BF080F0C
s_cbranch_scc1 label_0052 // 0000000000DC: BF85001A
s_flbit_i32_b32 s3, s12 // 0000000000E0: BE83150C
s_flbit_i32_b32 s14, s15 // 0000000000E4: BE8E150F
s_sub_i32 s21, s3, s14 // 0000000000E8: 81950E03
s_lshl_b32 s3, s12, s21 // 0000000000EC: 8F03150C
s_sub_u32 s18, s15, s3 // 0000000000F0: 8092030F
s_cselect_b32 s22, s15, s18 // 0000000000F4: 8516120F
s_cselect_b32 s19, 0, 1 // 0000000000F8: 85138180
s_add_u32 s16, s2, s19 // 0000000000FC: 80101302
s_cmp_eq_u32 0, s21 // 000000000100: BF061580
s_cbranch_scc0 label_0044 // 000000000104: BF840002
s_mov_b32 s2, s16 // 000000000108: BE820310
s_branch label_0052 // 00000000010C: BF82000E
label_0044:
s_mov_b32 s14, s16 // 000000000110: BE8E0310
label_0045:
s_add_u32 s14, s14, s14 // 000000000114: 800E0E0E
s_lshr_b32 s3, s3, 1 // 000000000118: 90038103
s_sub_u32 s19, s22, s3 // 00000000011C: 80930316
s_cselect_b32 s18, s22, s19 // 000000000120: 85121316
s_cselect_b32 s20, 0, 1 // 000000000124: 85148180
s_add_u32 s17, s14, s20 // 000000000128: 8011140E
s_addk_i32 s21, 0xffff // 00000000012C: B795FFFF
s_cmp_le_u32 s21, 0 // 000000000130: BF0B8015
s_cbranch_scc1 label_0051 // 000000000134: BF850003
s_mov_b32 s14, s17 // 000000000138: BE8E0311
s_mov_b32 s22, s18 // 00000000013C: BE960312
s_branch label_0045 // 000000000140: BF82FFF4
label_0051:
s_mov_b32 s2, s17 // 000000000144: BE820311
label_0052:
s_mul_i32 s3, s2, s12 // 000000000148: 93030C02
s_sub_i32 s3, s15, s3 // 00000000014C: 8183030F
s_cmp_eq_u32 0, s13 // 000000000150: BF060D80
s_cselect_b32 s17, -1, 0 // 000000000154: 851180C1
s_load_dword s4, s[4:5], 0x05 // 000000000158: C0020505
s_cmp_eq_u32 0, s13 // 00000000015C: BF060D80
s_cbranch_scc1 label_0075 // 000000000160: BF85001C
s_waitcnt lgkmcnt(0) // 000000000164: BF8C007F
s_cmp_gt_u32 s13, s4 // 000000000168: BF08040D
s_cbranch_scc1 label_0075 // 00000000016C: BF850019
s_flbit_i32_b32 s5, s13 // 000000000170: BE85150D
s_flbit_i32_b32 s15, s4 // 000000000174: BE8F1504
s_sub_i32 s5, s5, s15 // 000000000178: 81850F05
s_lshl_b32 s21, s13, s5 // 00000000017C: 8F15050D
s_sub_u32 s18, s4, s21 // 000000000180: 80921504
s_cselect_b32 s22, s4, s18 // 000000000184: 85161204
s_cselect_b32 s19, 0, 1 // 000000000188: 85138180
s_add_u32 s16, s17, s19 // 00000000018C: 80101311
s_cmp_eq_u32 0, s5 // 000000000190: BF060580
s_cbranch_scc0 label_0068 // 000000000194: BF840002
s_mov_b32 s17, s16 // 000000000198: BE910310
s_branch label_0075 // 00000000019C: BF82000D
label_0068:
s_mov_b32 s15, s16 // 0000000001A0: BE8F0310
label_0069:
s_add_u32 s15, s15, s15 // 0000000001A4: 800F0F0F
s_lshr_b32 s21, s21, 1 // 0000000001A8: 90158115
s_sub_u32 s19, s22, s21 // 0000000001AC: 80931516
s_cselect_b32 s18, s22, s19 // 0000000001B0: 85121316
s_cselect_b32 s20, 0, 1 // 0000000001B4: 85148180
s_add_u32 s17, s15, s20 // 0000000001B8: 8011140F
s_addk_i32 s5, 0xffff // 0000000001BC: B785FFFF
s_cmp_le_u32 s5, 0 // 0000000001C0: BF0B8005
s_cbranch_scc1 label_0075 // 0000000001C4: BF850003
s_mov_b32 s15, s17 // 0000000001C8: BE8F0311
s_mov_b32 s22, s18 // 0000000001CC: BE960312
s_branch label_0069 // 0000000001D0: BF82FFF4
label_0075:
s_mul_i32 s14, s17, s13 // 0000000001D4: 930E0D11
s_waitcnt lgkmcnt(0) // 0000000001D8: BF8C007F
s_sub_i32 s4, s4, s14 // 0000000001DC: 81840E04
s_cmp_lt_u32 s8, s0 // 0000000001E0: BF0A0008
s_cselect_b32 s0, s11, s1 // 0000000001E4: 8500010B
s_cmp_lt_u32 s9, s2 // 0000000001E8: BF0A0209
s_cselect_b32 s1, s12, s3 // 0000000001EC: 8501030C
s_cmp_lt_u32 s10, s17 // 0000000001F0: BF0A110A
s_cselect_b32 s2, s13, s4 // 0000000001F4: 8502040D
s_mul_i32 s3, s11, s8 // 0000000001F8: 9303080B
s_load_dwordx2 s[4:5], s[6:7], 0x00 // 0000000001FC: C0420700
v_cmp_ne_u32 s[8:9], 1, 0 // 000000000200: D18A0008 00010081
s_mul_i32 s0, s1, s0 // 000000000208: 93000001
s_mul_i32 s0, s0, s2 // 00000000020C: 93000200
s_waitcnt lgkmcnt(0) // 000000000210: BF8C007F
s_add_u32 s1, s3, s4 // 000000000214: 80010403
s_add_u32 s0, s0, 63 // 000000000218: 8000BF00
s_bcnt1_i32_b64 s2, s[8:9] // 00000000021C: BE821008
v_add_i32 v2, vcc, s1, v0 // 000000000220: 4A040001
s_lshr_b32 s0, s0, 6 // 000000000224: 90008600
v_mbcnt_lo_u32_b32 v3, -1, 0 // 000000000228: D2460003 000100C1
v_mbcnt_hi_u32_b32 v3, -1, v3 vop3 // 000000000230: D2480003 000206C1
v_bcnt_u32_b32 v4, s2, 0 // 000000000238: D2440004 00010002
v_mov_b32 v5, 1 // 000000000240: 7E0A0281
v_cmp_eq_i32 vcc, v4, v5 // 000000000244: 7D040B04
s_cbranch_vccz label_011E // 000000000248: BF86008B
v_xor_b32 v5, 1, v3 // 00000000024C: 3A0A0681
v_lshl_b64 v[6:7], 1, v5 // 000000000250: D2C20006 00020A81
v_and_b32 v6, exec_lo, v6 // 000000000258: 360C0C7E
v_and_b32 v7, exec_hi, v7 // 00000000025C: 360E0E7F
v_cmp_ne_u64 s[4:5], 0, v[6:7] // 000000000260: D1CA0004 00020C80
s_mov_b64 s[8:9], exec // 000000000268: BE88047E
s_mov_b64 s[10:11], exec // 00000000026C: BE8A047E
label_009C:
v_readfirstlane_b32 s1, v5 // 000000000270: 7E020505
v_cmp_eq_u32 vcc, s1, v5 // 000000000274: 7D840A01
s_and_saveexec_b64 s[12:13], vcc // 000000000278: BE8C246A
v_readlane_b32 s1, v2, s1 // 00000000027C: 02020302
s_cbranch_execz label_00A4 // 000000000280: BF880003
v_mov_b32 v5, s1 // 000000000284: 7E0A0201
s_andn2_b64 s[10:11], s[10:11], exec // 000000000288: 8A8A7E0A
s_cbranch_scc0 label_00A6 // 00000000028C: BF840002
label_00A4:
s_and_b64 exec, s[12:13], s[10:11] // 000000000290: 87FE0A0C
s_branch label_009C // 000000000294: BF82FFF6
label_00A6:
s_mov_b64 exec, s[8:9] // 000000000298: BEFE0408
v_cndmask_b32 v5, 0, v5, s[4:5] // 00000000029C: D2000005 00120A80
v_add_i32 v2, vcc, v5, v2 // 0000000002A4: 4A040505
v_xor_b32 v5, 2, v3 // 0000000002A8: 3A0A0682
v_lshl_b64 v[6:7], 1, v5 // 0000000002AC: D2C20006 00020A81
v_and_b32 v6, exec_lo, v6 // 0000000002B4: 360C0C7E
v_and_b32 v7, exec_hi, v7 // 0000000002B8: 360E0E7F
v_cmp_ne_u64 s[4:5], 0, v[6:7] // 0000000002BC: D1CA0004 00020C80
s_mov_b64 s[8:9], exec // 0000000002C4: BE88047E
s_mov_b64 s[10:11], exec // 0000000002C8: BE8A047E
label_00B3:
v_readfirstlane_b32 s1, v5 // 0000000002CC: 7E020505
v_cmp_eq_u32 vcc, s1, v5 // 0000000002D0: 7D840A01
s_and_saveexec_b64 s[12:13], vcc // 0000000002D4: BE8C246A
v_readlane_b32 s1, v2, s1 // 0000000002D8: 02020302
s_cbranch_execz label_00BB // 0000000002DC: BF880003
v_mov_b32 v5, s1 // 0000000002E0: 7E0A0201
s_andn2_b64 s[10:11], s[10:11], exec // 0000000002E4: 8A8A7E0A
s_cbranch_scc0 label_00BD // 0000000002E8: BF840002
label_00BB:
s_and_b64 exec, s[12:13], s[10:11] // 0000000002EC: 87FE0A0C
s_branch label_00B3 // 0000000002F0: BF82FFF6
label_00BD:
s_mov_b64 exec, s[8:9] // 0000000002F4: BEFE0408
v_cndmask_b32 v5, 0, v5, s[4:5] // 0000000002F8: D2000005 00120A80
v_add_i32 v2, vcc, v5, v2 // 000000000300: 4A040505
v_xor_b32 v5, 4, v3 // 000000000304: 3A0A0684
v_lshl_b64 v[6:7], 1, v5 // 000000000308: D2C20006 00020A81
v_and_b32 v6, exec_lo, v6 // 000000000310: 360C0C7E
v_and_b32 v7, exec_hi, v7 // 000000000314: 360E0E7F
v_cmp_ne_u64 s[4:5], 0, v[6:7] // 000000000318: D1CA0004 00020C80
s_mov_b64 s[8:9], exec // 000000000320: BE88047E
s_mov_b64 s[10:11], exec // 000000000324: BE8A047E
label_00CA:
v_readfirstlane_b32 s1, v5 // 000000000328: 7E020505
v_cmp_eq_u32 vcc, s1, v5 // 00000000032C: 7D840A01
s_and_saveexec_b64 s[12:13], vcc // 000000000330: BE8C246A
v_readlane_b32 s1, v2, s1 // 000000000334: 02020302
s_cbranch_execz label_00D2 // 000000000338: BF880003
v_mov_b32 v5, s1 // 00000000033C: 7E0A0201
s_andn2_b64 s[10:11], s[10:11], exec // 000000000340: 8A8A7E0A
s_cbranch_scc0 label_00D4 // 000000000344: BF840002
label_00D2:
s_and_b64 exec, s[12:13], s[10:11] // 000000000348: 87FE0A0C
s_branch label_00CA // 00000000034C: BF82FFF6
label_00D4:
s_mov_b64 exec, s[8:9] // 000000000350: BEFE0408
v_cndmask_b32 v5, 0, v5, s[4:5] // 000000000354: D2000005 00120A80
v_add_i32 v2, vcc, v5, v2 // 00000000035C: 4A040505
v_xor_b32 v5, 32, v3 // 000000000360: 3A0A06A0
v_xor_b32 v6, 16, v3 // 000000000364: 3A0C0690
v_xor_b32 v7, 8, v3 // 000000000368: 3A0E0688
v_lshl_b64 v[8:9], 1, v7 // 00000000036C: D2C20008 00020E81
v_and_b32 v8, exec_lo, v8 // 000000000374: 3610107E
v_and_b32 v9, exec_hi, v9 // 000000000378: 3612127F
v_cmp_ne_u64 s[4:5], 0, v[8:9] // 00000000037C: D1CA0004 00021080
s_mov_b64 s[8:9], exec // 000000000384: BE88047E
s_mov_b64 s[10:11], exec // 000000000388: BE8A047E
label_00E3:
v_readfirstlane_b32 s1, v7 // 00000000038C: 7E020507
v_cmp_eq_u32 vcc, s1, v7 // 000000000390: 7D840E01
s_and_saveexec_b64 s[12:13], vcc // 000000000394: BE8C246A
v_readlane_b32 s1, v2, s1 // 000000000398: 02020302
s_cbranch_execz label_00EB // 00000000039C: BF880003
v_mov_b32 v7, s1 // 0000000003A0: 7E0E0201
s_andn2_b64 s[10:11], s[10:11], exec // 0000000003A4: 8A8A7E0A
s_cbranch_scc0 label_00ED // 0000000003A8: BF840002
label_00EB:
s_and_b64 exec, s[12:13], s[10:11] // 0000000003AC: 87FE0A0C
s_branch label_00E3 // 0000000003B0: BF82FFF6
label_00ED:
s_mov_b64 exec, s[8:9] // 0000000003B4: BEFE0408
v_cndmask_b32 v7, 0, v7, s[4:5] // 0000000003B8: D2000007 00120E80
v_add_i32 v2, vcc, v7, v2 // 0000000003C0: 4A040507
v_lshl_b64 v[7:8], 1, v6 // 0000000003C4: D2C20007 00020C81
v_and_b32 v7, exec_lo, v7 // 0000000003CC: 360E0E7E
v_and_b32 v8, exec_hi, v8 // 0000000003D0: 3610107F
v_cmp_ne_u64 s[4:5], 0, v[7:8] // 0000000003D4: D1CA0004 00020E80
s_mov_b64 s[8:9], exec // 0000000003DC: BE88047E
s_mov_b64 s[10:11], exec // 0000000003E0: BE8A047E
label_00F9:
v_readfirstlane_b32 s1, v6 // 0000000003E4: 7E020506
v_cmp_eq_u32 vcc, s1, v6 // 0000000003E8: 7D840C01
s_and_saveexec_b64 s[12:13], vcc // 0000000003EC: BE8C246A
v_readlane_b32 s1, v2, s1 // 0000000003F0: 02020302
s_cbranch_execz label_0101 // 0000000003F4: BF880003
v_mov_b32 v6, s1 // 0000000003F8: 7E0C0201
s_andn2_b64 s[10:11], s[10:11], exec // 0000000003FC: 8A8A7E0A
s_cbranch_scc0 label_0103 // 000000000400: BF840002
label_0101:
s_and_b64 exec, s[12:13], s[10:11] // 000000000404: 87FE0A0C
s_branch label_00F9 // 000000000408: BF82FFF6
label_0103:
s_mov_b64 exec, s[8:9] // 00000000040C: BEFE0408
v_cndmask_b32 v6, 0, v6, s[4:5] // 000000000410: D2000006 00120C80
v_add_i32 v2, vcc, v6, v2 // 000000000418: 4A040506
v_lshl_b64 v[6:7], 1, v5 // 00000000041C: D2C20006 00020A81
v_and_b32 v6, exec_lo, v6 // 000000000424: 360C0C7E
v_and_b32 v7, exec_hi, v7 // 000000000428: 360E0E7F
v_cmp_ne_u64 s[4:5], 0, v[6:7] // 00000000042C: D1CA0004 00020C80
s_mov_b64 s[8:9], exec // 000000000434: BE88047E
s_mov_b64 s[10:11], exec // 000000000438: BE8A047E
label_010F:
v_readfirstlane_b32 s1, v5 // 00000000043C: 7E020505
v_cmp_eq_u32 vcc, s1, v5 // 000000000440: 7D840A01
s_and_saveexec_b64 s[12:13], vcc // 000000000444: BE8C246A
v_readlane_b32 s1, v2, s1 // 000000000448: 02020302
s_cbranch_execz label_0117 // 00000000044C: BF880003
v_mov_b32 v5, s1 // 000000000450: 7E0A0201
s_andn2_b64 s[10:11], s[10:11], exec // 000000000454: 8A8A7E0A
s_cbranch_scc0 label_0119 // 000000000458: BF840002
label_0117:
s_and_b64 exec, s[12:13], s[10:11] // 00000000045C: 87FE0A0C
s_branch label_010F // 000000000460: BF82FFF6
label_0119:
s_mov_b64 exec, s[8:9] // 000000000464: BEFE0408
v_cndmask_b32 v5, 0, v5, s[4:5] // 000000000468: D2000005 00120A80
v_add_i32 v7, vcc, v5, v2 // 000000000470: 4A0E0505
s_branch label_01C4 // 000000000474: BF8200A6
label_011E:
v_cmp_eq_i32 s[4:5], v3, 0 // 000000000478: D1040004 00010103
v_add_i32 v5, vcc, 63, v3 // 000000000480: 4A0A06BF
v_and_b32 v5, 63, v5 // 000000000484: 360A0ABF
v_lshl_b64 v[6:7], 1, v5 // 000000000488: D2C20006 00020A81
v_and_b32 v6, exec_lo, v6 // 000000000490: 360C0C7E
v_and_b32 v7, exec_hi, v7 // 000000000494: 360E0E7F
v_cmp_ne_u64 vcc, 0, v[6:7] // 000000000498: 7DCA0C80
s_orn2_b64 s[4:5], s[4:5], vcc // 00000000049C: 8B846A04
s_mov_b64 s[8:9], exec // 0000000004A0: BE88047E
s_mov_b64 s[10:11], exec // 0000000004A4: BE8A047E
label_012A:
v_readfirstlane_b32 s1, v5 // 0000000004A8: 7E020505
v_cmp_eq_u32 vcc, s1, v5 // 0000000004AC: 7D840A01
s_and_saveexec_b64 s[12:13], vcc // 0000000004B0: BE8C246A
v_readlane_b32 s1, v2, s1 // 0000000004B4: 02020302
s_cbranch_execz label_0132 // 0000000004B8: BF880003
v_mov_b32 v5, s1 // 0000000004BC: 7E0A0201
s_andn2_b64 s[10:11], s[10:11], exec // 0000000004C0: 8A8A7E0A
s_cbranch_scc0 label_0134 // 0000000004C4: BF840002
label_0132:
s_and_b64 exec, s[12:13], s[10:11] // 0000000004C8: 87FE0A0C
s_branch label_012A // 0000000004CC: BF82FFF6
label_0134:
s_mov_b64 exec, s[8:9] // 0000000004D0: BEFE0408
v_cndmask_b32 v5, v5, 0, s[4:5] // 0000000004D4: D2000005 00110105
v_add_i32 v2, vcc, v5, v2 // 0000000004DC: 4A040505
v_cmp_lt_u32 s[4:5], v3, 2 // 0000000004E0: D1820004 00010503
v_add_i32 v5, vcc, 62, v3 // 0000000004E8: 4A0A06BE
v_and_b32 v5, 63, v5 // 0000000004EC: 360A0ABF
v_lshl_b64 v[6:7], 1, v5 // 0000000004F0: D2C20006 00020A81
v_and_b32 v6, exec_lo, v6 // 0000000004F8: 360C0C7E
v_and_b32 v7, exec_hi, v7 // 0000000004FC: 360E0E7F
v_cmp_ne_u64 vcc, 0, v[6:7] // 000000000500: 7DCA0C80
s_orn2_b64 s[4:5], s[4:5], vcc // 000000000504: 8B846A04
s_mov_b64 s[8:9], exec // 000000000508: BE88047E
s_mov_b64 s[10:11], exec // 00000000050C: BE8A047E
label_0144:
v_readfirstlane_b32 s1, v5 // 000000000510: 7E020505
v_cmp_eq_u32 vcc, s1, v5 // 000000000514: 7D840A01
s_and_saveexec_b64 s[12:13], vcc // 000000000518: BE8C246A
v_readlane_b32 s1, v2, s1 // 00000000051C: 02020302
s_cbranch_execz label_014C // 000000000520: BF880003
v_mov_b32 v5, s1 // 000000000524: 7E0A0201
s_andn2_b64 s[10:11], s[10:11], exec // 000000000528: 8A8A7E0A
s_cbranch_scc0 label_014E // 00000000052C: BF840002
label_014C:
s_and_b64 exec, s[12:13], s[10:11] // 000000000530: 87FE0A0C
s_branch label_0144 // 000000000534: BF82FFF6
label_014E:
s_mov_b64 exec, s[8:9] // 000000000538: BEFE0408
v_cndmask_b32 v5, v5, 0, s[4:5] // 00000000053C: D2000005 00110105
v_add_i32 v2, vcc, v5, v2 // 000000000544: 4A040505
v_cmp_lt_u32 s[4:5], v3, 4 // 000000000548: D1820004 00010903
v_add_i32 v5, vcc, 60, v3 // 000000000550: 4A0A06BC
v_and_b32 v5, 63, v5 // 000000000554: 360A0ABF
v_lshl_b64 v[6:7], 1, v5 // 000000000558: D2C20006 00020A81
v_and_b32 v6, exec_lo, v6 // 000000000560: 360C0C7E
v_and_b32 v7, exec_hi, v7 // 000000000564: 360E0E7F
v_cmp_ne_u64 vcc, 0, v[6:7] // 000000000568: 7DCA0C80
s_orn2_b64 s[4:5], s[4:5], vcc // 00000000056C: 8B846A04
s_mov_b64 s[8:9], exec // 000000000570: BE88047E
s_mov_b64 s[10:11], exec // 000000000574: BE8A047E
label_015E:
v_readfirstlane_b32 s1, v5 // 000000000578: 7E020505
v_cmp_eq_u32 vcc, s1, v5 // 00000000057C: 7D840A01
s_and_saveexec_b64 s[12:13], vcc // 000000000580: BE8C246A
v_readlane_b32 s1, v2, s1 // 000000000584: 02020302
s_cbranch_execz label_0166 // 000000000588: BF880003
v_mov_b32 v5, s1 // 00000000058C: 7E0A0201
s_andn2_b64 s[10:11], s[10:11], exec // 000000000590: 8A8A7E0A
s_cbranch_scc0 label_0168 // 000000000594: BF840002
label_0166:
s_and_b64 exec, s[12:13], s[10:11] // 000000000598: 87FE0A0C
s_branch label_015E // 00000000059C: BF82FFF6
label_0168:
s_mov_b64 exec, s[8:9] // 0000000005A0: BEFE0408
v_cndmask_b32 v5, v5, 0, s[4:5] // 0000000005A4: D2000005 00110105
v_add_i32 v6, vcc, 56, v3 // 0000000005AC: 4A0C06B8
v_add_i32 v2, vcc, v5, v2 // 0000000005B0: 4A040505
v_cmp_lt_u32 s[4:5], v3, 8 // 0000000005B4: D1820004 00011103
v_and_b32 v5, 63, v6 // 0000000005BC: 360A0CBF
v_lshl_b64 v[6:7], 1, v5 // 0000000005C0: D2C20006 00020A81
v_and_b32 v6, exec_lo, v6 // 0000000005C8: 360C0C7E
v_and_b32 v7, exec_hi, v7 // 0000000005CC: 360E0E7F
v_cmp_ne_u64 vcc, 0, v[6:7] // 0000000005D0: 7DCA0C80
s_orn2_b64 s[4:5], s[4:5], vcc // 0000000005D4: 8B846A04
s_mov_b64 s[8:9], exec // 0000000005D8: BE88047E
s_mov_b64 s[10:11], exec // 0000000005DC: BE8A047E
label_0178:
v_readfirstlane_b32 s1, v5 // 0000000005E0: 7E020505
v_cmp_eq_u32 vcc, s1, v5 // 0000000005E4: 7D840A01
s_and_saveexec_b64 s[12:13], vcc // 0000000005E8: BE8C246A
v_readlane_b32 s1, v2, s1 // 0000000005EC: 02020302
s_cbranch_execz label_0180 // 0000000005F0: BF880003
v_mov_b32 v5, s1 // 0000000005F4: 7E0A0201
s_andn2_b64 s[10:11], s[10:11], exec // 0000000005F8: 8A8A7E0A
s_cbranch_scc0 label_0182 // 0000000005FC: BF840002
label_0180:
s_and_b64 exec, s[12:13], s[10:11] // 000000000600: 87FE0A0C
s_branch label_0178 // 000000000604: BF82FFF6
label_0182:
s_mov_b64 exec, s[8:9] // 000000000608: BEFE0408
v_cndmask_b32 v5, v5, 0, s[4:5] // 00000000060C: D2000005 00110105
v_add_i32 v6, vcc, 48, v3 // 000000000614: 4A0C06B0
v_and_b32 v6, 63, v6 // 000000000618: 360C0CBF
v_add_i32 v2, vcc, v5, v2 // 00000000061C: 4A040505
v_cmp_lt_u32 s[4:5], v3, 16 // 000000000620: D1820004 00012103
v_lshl_b64 v[7:8], 1, v6 // 000000000628: D2C20007 00020C81
v_and_b32 v7, exec_lo, v7 // 000000000630: 360E0E7E
v_and_b32 v8, exec_hi, v8 // 000000000634: 3610107F
v_cmp_ne_u64 vcc, 0, v[7:8] // 000000000638: 7DCA0E80
s_orn2_b64 s[4:5], s[4:5], vcc // 00000000063C: 8B846A04
s_mov_b64 s[8:9], exec // 000000000640: BE88047E
s_mov_b64 s[10:11], exec // 000000000644: BE8A047E
label_0192:
v_readfirstlane_b32 s1, v6 // 000000000648: 7E020506
v_cmp_eq_u32 vcc, s1, v6 // 00000000064C: 7D840C01
s_and_saveexec_b64 s[12:13], vcc // 000000000650: BE8C246A
v_readlane_b32 s1, v2, s1 // 000000000654: 02020302
s_cbranch_execz label_019A // 000000000658: BF880003
v_mov_b32 v5, s1 // 00000000065C: 7E0A0201
s_andn2_b64 s[10:11], s[10:11], exec // 000000000660: 8A8A7E0A
s_cbranch_scc0 label_019C // 000000000664: BF840002
label_019A:
s_and_b64 exec, s[12:13], s[10:11] // 000000000668: 87FE0A0C
s_branch label_0192 // 00000000066C: BF82FFF6
label_019C:
s_mov_b64 exec, s[8:9] // 000000000670: BEFE0408
v_cndmask_b32 v5, v5, 0, s[4:5] // 000000000674: D2000005 00110105
v_add_i32 v6, vcc, 32, v3 // 00000000067C: 4A0C06A0
v_and_b32 v6, 63, v6 // 000000000680: 360C0CBF
v_add_i32 v2, vcc, v5, v2 // 000000000684: 4A040505
v_cmp_lt_u32 s[4:5], v3, 32 // 000000000688: D1820004 00014103
v_lshl_b64 v[7:8], 1, v6 // 000000000690: D2C20007 00020C81
v_and_b32 v7, exec_lo, v7 // 000000000698: 360E0E7E
v_and_b32 v8, exec_hi, v8 // 00000000069C: 3610107F
v_cmp_ne_u64 vcc, 0, v[7:8] // 0000000006A0: 7DCA0E80
s_orn2_b64 s[4:5], s[4:5], vcc // 0000000006A4: 8B846A04
s_mov_b64 s[8:9], exec // 0000000006A8: BE88047E
s_mov_b64 s[10:11], exec // 0000000006AC: BE8A047E
label_01AC:
v_readfirstlane_b32 s1, v6 // 0000000006B0: 7E020506
v_cmp_eq_u32 vcc, s1, v6 // 0000000006B4: 7D840C01
s_and_saveexec_b64 s[12:13], vcc // 0000000006B8: BE8C246A
v_readlane_b32 s1, v2, s1 // 0000000006BC: 02020302
s_cbranch_execz label_01B4 // 0000000006C0: BF880003
v_mov_b32 v5, s1 // 0000000006C4: 7E0A0201
s_andn2_b64 s[10:11], s[10:11], exec // 0000000006C8: 8A8A7E0A
s_cbranch_scc0 label_01B6 // 0000000006CC: BF840002
label_01B4:
s_and_b64 exec, s[12:13], s[10:11] // 0000000006D0: 87FE0A0C
s_branch label_01AC // 0000000006D4: BF82FFF6
label_01B6:
s_mov_b64 exec, s[8:9] // 0000000006D8: BEFE0408
v_cndmask_b32 v5, v5, 0, s[4:5] // 0000000006DC: D2000005 00110105
s_add_u32 s1, s2, -1 // 0000000006E4: 8001C102
v_add_i32 v2, vcc, v5, v2 // 0000000006E8: 4A040505
v_mov_b32 v5, s1 // 0000000006EC: 7E0A0201
v_lshl_b64 v[5:6], 1, v5 // 0000000006F0: D2C20005 00020A81
v_and_b32 v5, exec_lo, v5 // 0000000006F8: 360A0A7E
v_and_b32 v6, exec_hi, v6 // 0000000006FC: 360C0C7F
v_cmp_ne_u64 vcc, 0, v[5:6] // 000000000700: 7DCA0A80
v_readlane_b32 s1, v2, s1 // 000000000704: 02020302
v_mov_b32 v2, s1 // 000000000708: 7E040201
v_cndmask_b32 v7, 0, v2, vcc // 00000000070C: 000E0480
label_01C4:
s_cmp_eq_i32 s0, 1 // 000000000710: BF008100
s_cbranch_scc1 label_0320 // 000000000714: BF85015A
v_lshrrev_b32 v5, 6, v1 // 000000000718: 2C0A0286
v_and_b32 v1, 63, v1 // 00000000071C: 360202BF
v_cmp_ne_i32 s[4:5], v1, 0 // 000000000720: D10A0004 00010101
s_mov_b64 s[8:9], exec // 000000000728: BE88047E
s_andn2_b64 exec, s[8:9], s[4:5] // 00000000072C: 8AFE0408
v_lshlrev_b32 v6, 2, v5 // 000000000730: 340C0A82
ds_write_b32 v6, v7 // 000000000734: D8340000 00000706
s_mov_b64 exec, s[8:9] // 00000000073C: BEFE0408
s_waitcnt lgkmcnt(0) // 000000000740: BF8C007F
s_barrier // 000000000744: BF8A0000
v_cmp_eq_i32 vcc, 0, v5 // 000000000748: 7D040A80
s_and_b64 exec, s[8:9], vcc // 00000000074C: 87FE6A08
v_cmp_gt_u32 vcc, s0, v1 // 000000000750: 7D880200
s_cbranch_execz label_0318 // 000000000754: BF880142
s_and_saveexec_b64 s[0:1], vcc // 000000000758: BE80246A
v_lshlrev_b32 v1, 2, v1 // 00000000075C: 34020282
ds_read_b32 v1, v1 // 000000000760: D8D80000 01000001
s_andn2_b64 exec, s[0:1], exec // 000000000768: 8AFE7E00
v_mov_b32 v1, 0 // 00000000076C: 7E020280
s_mov_b64 exec, s[0:1] // 000000000770: BEFE0400
v_mov_b32 v2, 1 // 000000000774: 7E040281
v_cmp_eq_i32 vcc, v4, v2 // 000000000778: 7D040504
s_cbranch_vccz label_026C // 00000000077C: BF86008C
v_xor_b32 v2, 1, v3 // 000000000780: 3A040681
v_lshl_b64 v[4:5], 1, v2 // 000000000784: D2C20004 00020481
v_and_b32 v4, exec_lo, v4 // 00000000078C: 3608087E
v_and_b32 v5, exec_hi, v5 // 000000000790: 360A0A7F
v_cmp_ne_u64 s[0:1], 0, v[4:5] // 000000000794: D1CA0000 00020880
s_mov_b64 s[2:3], exec // 00000000079C: BE82047E
s_mov_b64 s[10:11], exec // 0000000007A0: BE8A047E
label_01E9:
v_readfirstlane_b32 s12, v2 // 0000000007A4: 7E180502
v_cmp_eq_u32 vcc, s12, v2 // 0000000007A8: 7D84040C
s_and_saveexec_b64 s[14:15], vcc // 0000000007AC: BE8E246A
s_waitcnt lgkmcnt(0) // 0000000007B0: BF8C007F
v_readlane_b32 s12, v1, s12 // 0000000007B4: 02181901
s_cbranch_execz label_01F2 // 0000000007B8: BF880003
v_mov_b32 v2, s12 // 0000000007BC: 7E04020C
s_andn2_b64 s[10:11], s[10:11], exec // 0000000007C0: 8A8A7E0A
s_cbranch_scc0 label_01F4 // 0000000007C4: BF840002
label_01F2:
s_mov_b64 exec, s[10:11] // 0000000007C8: BEFE040A
s_branch label_01E9 // 0000000007CC: BF82FFF5
label_01F4:
s_mov_b64 exec, s[2:3] // 0000000007D0: BEFE0402
v_cndmask_b32 v2, 0, v2, s[0:1] // 0000000007D4: D2000002 00020480
v_add_i32 v1, vcc, v2, v1 // 0000000007DC: 4A020302
v_xor_b32 v2, 2, v3 // 0000000007E0: 3A040682
v_lshl_b64 v[4:5], 1, v2 // 0000000007E4: D2C20004 00020481
v_and_b32 v4, exec_lo, v4 // 0000000007EC: 3608087E
v_and_b32 v5, exec_hi, v5 // 0000000007F0: 360A0A7F
v_cmp_ne_u64 s[0:1], 0, v[4:5] // 0000000007F4: D1CA0000 00020880
s_mov_b64 s[2:3], exec // 0000000007FC: BE82047E
s_mov_b64 s[10:11], exec // 000000000800: BE8A047E
label_0201:
v_readfirstlane_b32 s12, v2 // 000000000804: 7E180502
v_cmp_eq_u32 vcc, s12, v2 // 000000000808: 7D84040C
s_and_saveexec_b64 s[14:15], vcc // 00000000080C: BE8E246A
v_readlane_b32 s12, v1, s12 // 000000000810: 02181901
s_cbranch_execz label_0209 // 000000000814: BF880003
v_mov_b32 v2, s12 // 000000000818: 7E04020C
s_andn2_b64 s[10:11], s[10:11], exec // 00000000081C: 8A8A7E0A
s_cbranch_scc0 label_020B // 000000000820: BF840002
label_0209:
s_mov_b64 exec, s[10:11] // 000000000824: BEFE040A
s_branch label_0201 // 000000000828: BF82FFF6
label_020B:
s_mov_b64 exec, s[2:3] // 00000000082C: BEFE0402
v_cndmask_b32 v2, 0, v2, s[0:1] // 000000000830: D2000002 00020480
v_add_i32 v1, vcc, v2, v1 // 000000000838: 4A020302
v_xor_b32 v2, 4, v3 // 00000000083C: 3A040684
v_lshl_b64 v[4:5], 1, v2 // 000000000840: D2C20004 00020481
v_and_b32 v4, exec_lo, v4 // 000000000848: 3608087E
v_and_b32 v5, exec_hi, v5 // 00000000084C: 360A0A7F
v_cmp_ne_u64 s[0:1], 0, v[4:5] // 000000000850: D1CA0000 00020880
s_mov_b64 s[2:3], exec // 000000000858: BE82047E
s_mov_b64 s[10:11], exec // 00000000085C: BE8A047E
label_0218:
v_readfirstlane_b32 s12, v2 // 000000000860: 7E180502
v_cmp_eq_u32 vcc, s12, v2 // 000000000864: 7D84040C
s_and_saveexec_b64 s[14:15], vcc // 000000000868: BE8E246A
v_readlane_b32 s12, v1, s12 // 00000000086C: 02181901
s_cbranch_execz label_0220 // 000000000870: BF880003
v_mov_b32 v2, s12 // 000000000874: 7E04020C
s_andn2_b64 s[10:11], s[10:11], exec // 000000000878: 8A8A7E0A
s_cbranch_scc0 label_0222 // 00000000087C: BF840002
label_0220:
s_mov_b64 exec, s[10:11] // 000000000880: BEFE040A
s_branch label_0218 // 000000000884: BF82FFF6
label_0222:
s_mov_b64 exec, s[2:3] // 000000000888: BEFE0402
v_cndmask_b32 v2, 0, v2, s[0:1] // 00000000088C: D2000002 00020480
v_add_i32 v1, vcc, v2, v1 // 000000000894: 4A020302
v_xor_b32 v2, 32, v3 // 000000000898: 3A0406A0
v_xor_b32 v4, 16, v3 // 00000000089C: 3A080690
v_xor_b32 v3, 8, v3 // 0000000008A0: 3A060688
v_lshl_b64 v[5:6], 1, v3 // 0000000008A4: D2C20005 00020681
v_and_b32 v5, exec_lo, v5 // 0000000008AC: 360A0A7E
v_and_b32 v6, exec_hi, v6 // 0000000008B0: 360C0C7F
v_cmp_ne_u64 s[0:1], 0, v[5:6] // 0000000008B4: D1CA0000 00020A80
s_mov_b64 s[2:3], exec // 0000000008BC: BE82047E
s_mov_b64 s[10:11], exec // 0000000008C0: BE8A047E
label_0231:
v_readfirstlane_b32 s12, v3 // 0000000008C4: 7E180503
v_cmp_eq_u32 vcc, s12, v3 // 0000000008C8: 7D84060C
s_and_saveexec_b64 s[14:15], vcc // 0000000008CC: BE8E246A
v_readlane_b32 s12, v1, s12 // 0000000008D0: 02181901
s_cbranch_execz label_0239 // 0000000008D4: BF880003
v_mov_b32 v3, s12 // 0000000008D8: 7E06020C
s_andn2_b64 s[10:11], s[10:11], exec // 0000000008DC: 8A8A7E0A
s_cbranch_scc0 label_023B // 0000000008E0: BF840002
label_0239:
s_mov_b64 exec, s[10:11] // 0000000008E4: BEFE040A
s_branch label_0231 // 0000000008E8: BF82FFF6
label_023B:
s_mov_b64 exec, s[2:3] // 0000000008EC: BEFE0402
v_cndmask_b32 v3, 0, v3, s[0:1] // 0000000008F0: D2000003 00020680
v_add_i32 v1, vcc, v3, v1 // 0000000008F8: 4A020303
v_lshl_b64 v[5:6], 1, v4 // 0000000008FC: D2C20005 00020881
v_and_b32 v5, exec_lo, v5 // 000000000904: 360A0A7E
v_and_b32 v6, exec_hi, v6 // 000000000908: 360C0C7F
v_cmp_ne_u64 s[0:1], 0, v[5:6] // 00000000090C: D1CA0000 00020A80
s_mov_b64 s[2:3], exec // 000000000914: BE82047E
s_mov_b64 s[10:11], exec // 000000000918: BE8A047E
label_0247:
v_readfirstlane_b32 s12, v4 // 00000000091C: 7E180504
v_cmp_eq_u32 vcc, s12, v4 // 000000000920: 7D84080C
s_and_saveexec_b64 s[14:15], vcc // 000000000924: BE8E246A
v_readlane_b32 s12, v1, s12 // 000000000928: 02181901
s_cbranch_execz label_024F // 00000000092C: BF880003
v_mov_b32 v3, s12 // 000000000930: 7E06020C
s_andn2_b64 s[10:11], s[10:11], exec // 000000000934: 8A8A7E0A
s_cbranch_scc0 label_0251 // 000000000938: BF840002
label_024F:
s_mov_b64 exec, s[10:11] // 00000000093C: BEFE040A
s_branch label_0247 // 000000000940: BF82FFF6
label_0251:
s_mov_b64 exec, s[2:3] // 000000000944: BEFE0402
v_cndmask_b32 v3, 0, v3, s[0:1] // 000000000948: D2000003 00020680
v_add_i32 v1, vcc, v3, v1 // 000000000950: 4A020303
v_lshl_b64 v[3:4], 1, v2 // 000000000954: D2C20003 00020481
v_and_b32 v3, exec_lo, v3 // 00000000095C: 3606067E
v_and_b32 v4, exec_hi, v4 // 000000000960: 3608087F
v_cmp_ne_u64 s[0:1], 0, v[3:4] // 000000000964: D1CA0000 00020680
s_mov_b64 s[2:3], exec // 00000000096C: BE82047E
s_mov_b64 s[10:11], exec // 000000000970: BE8A047E
label_025D:
v_readfirstlane_b32 s12, v2 // 000000000974: 7E180502
v_cmp_eq_u32 vcc, s12, v2 // 000000000978: 7D84040C
s_and_saveexec_b64 s[14:15], vcc // 00000000097C: BE8E246A
v_readlane_b32 s12, v1, s12 // 000000000980: 02181901
s_cbranch_execz label_0265 // 000000000984: BF880003
v_mov_b32 v2, s12 // 000000000988: 7E04020C
s_andn2_b64 s[10:11], s[10:11], exec // 00000000098C: 8A8A7E0A
s_cbranch_scc0 label_0267 // 000000000990: BF840002
label_0265:
s_mov_b64 exec, s[10:11] // 000000000994: BEFE040A
s_branch label_025D // 000000000998: BF82FFF6
label_0267:
s_mov_b64 exec, s[2:3] // 00000000099C: BEFE0402
v_cndmask_b32 v2, 0, v2, s[0:1] // 0000000009A0: D2000002 00020480
v_add_i32 v1, vcc, v2, v1 // 0000000009A8: 4A020302
s_branch label_0313 // 0000000009AC: BF8200A7
label_026C:
v_cmp_eq_i32 s[0:1], v3, 0 // 0000000009B0: D1040000 00010103
v_add_i32 v2, vcc, 63, v3 // 0000000009B8: 4A0406BF
v_and_b32 v2, 63, v2 // 0000000009BC: 360404BF
v_lshl_b64 v[4:5], 1, v2 // 0000000009C0: D2C20004 00020481
v_and_b32 v4, exec_lo, v4 // 0000000009C8: 3608087E
v_and_b32 v5, exec_hi, v5 // 0000000009CC: 360A0A7F
v_cmp_ne_u64 vcc, 0, v[4:5] // 0000000009D0: 7DCA0880
s_orn2_b64 s[0:1], s[0:1], vcc // 0000000009D4: 8B806A00
s_mov_b64 s[10:11], exec // 0000000009D8: BE8A047E
s_mov_b64 s[12:13], exec // 0000000009DC: BE8C047E
label_0278:
v_readfirstlane_b32 s3, v2 // 0000000009E0: 7E060502
v_cmp_eq_u32 vcc, s3, v2 // 0000000009E4: 7D840403
s_and_saveexec_b64 s[14:15], vcc // 0000000009E8: BE8E246A
s_waitcnt lgkmcnt(0) // 0000000009EC: BF8C007F
v_readlane_b32 s3, v1, s3 // 0000000009F0: 02060701
s_cbranch_execz label_0281 // 0000000009F4: BF880003
v_mov_b32 v2, s3 // 0000000009F8: 7E040203
s_andn2_b64 s[12:13], s[12:13], exec // 0000000009FC: 8A8C7E0C
s_cbranch_scc0 label_0283 // 000000000A00: BF840002
label_0281:
s_mov_b64 exec, s[12:13] // 000000000A04: BEFE040C
s_branch label_0278 // 000000000A08: BF82FFF5
label_0283:
s_mov_b64 exec, s[10:11] // 000000000A0C: BEFE040A
v_cndmask_b32 v2, v2, 0, s[0:1] // 000000000A10: D2000002 00010102
v_add_i32 v1, vcc, v2, v1 // 000000000A18: 4A020302
v_cmp_lt_u32 s[0:1], v3, 2 // 000000000A1C: D1820000 00010503
v_add_i32 v2, vcc, 62, v3 // 000000000A24: 4A0406BE
v_and_b32 v2, 63, v2 // 000000000A28: 360404BF
v_lshl_b64 v[4:5], 1, v2 // 000000000A2C: D2C20004 00020481
v_and_b32 v4, exec_lo, v4 // 000000000A34: 3608087E
v_and_b32 v5, exec_hi, v5 // 000000000A38: 360A0A7F
v_cmp_ne_u64 vcc, 0, v[4:5] // 000000000A3C: 7DCA0880
s_orn2_b64 s[0:1], s[0:1], vcc // 000000000A40: 8B806A00
s_mov_b64 s[10:11], exec // 000000000A44: BE8A047E
s_mov_b64 s[12:13], exec // 000000000A48: BE8C047E
label_0293:
v_readfirstlane_b32 s3, v2 // 000000000A4C: 7E060502
v_cmp_eq_u32 vcc, s3, v2 // 000000000A50: 7D840403
s_and_saveexec_b64 s[14:15], vcc // 000000000A54: BE8E246A
v_readlane_b32 s3, v1, s3 // 000000000A58: 02060701
s_cbranch_execz label_029B // 000000000A5C: BF880003
v_mov_b32 v2, s3 // 000000000A60: 7E040203
s_andn2_b64 s[12:13], s[12:13], exec // 000000000A64: 8A8C7E0C
s_cbranch_scc0 label_029D // 000000000A68: BF840002
label_029B:
s_mov_b64 exec, s[12:13] // 000000000A6C: BEFE040C
s_branch label_0293 // 000000000A70: BF82FFF6
label_029D:
s_mov_b64 exec, s[10:11] // 000000000A74: BEFE040A
v_cndmask_b32 v2, v2, 0, s[0:1] // 000000000A78: D2000002 00010102
v_add_i32 v1, vcc, v2, v1 // 000000000A80: 4A020302
v_cmp_lt_u32 s[0:1], v3, 4 // 000000000A84: D1820000 00010903
v_add_i32 v2, vcc, 60, v3 // 000000000A8C: 4A0406BC
v_and_b32 v2, 63, v2 // 000000000A90: 360404BF
v_lshl_b64 v[4:5], 1, v2 // 000000000A94: D2C20004 00020481
v_and_b32 v4, exec_lo, v4 // 000000000A9C: 3608087E
v_and_b32 v5, exec_hi, v5 // 000000000AA0: 360A0A7F
v_cmp_ne_u64 vcc, 0, v[4:5] // 000000000AA4: 7DCA0880
s_orn2_b64 s[0:1], s[0:1], vcc // 000000000AA8: 8B806A00
s_mov_b64 s[10:11], exec // 000000000AAC: BE8A047E
s_mov_b64 s[12:13], exec // 000000000AB0: BE8C047E
label_02AD:
v_readfirstlane_b32 s3, v2 // 000000000AB4: 7E060502
v_cmp_eq_u32 vcc, s3, v2 // 000000000AB8: 7D840403
s_and_saveexec_b64 s[14:15], vcc // 000000000ABC: BE8E246A
v_readlane_b32 s3, v1, s3 // 000000000AC0: 02060701
s_cbranch_execz label_02B5 // 000000000AC4: BF880003
v_mov_b32 v2, s3 // 000000000AC8: 7E040203
s_andn2_b64 s[12:13], s[12:13], exec // 000000000ACC: 8A8C7E0C
s_cbranch_scc0 label_02B7 // 000000000AD0: BF840002
label_02B5:
s_mov_b64 exec, s[12:13] // 000000000AD4: BEFE040C
s_branch label_02AD // 000000000AD8: BF82FFF6
label_02B7:
s_mov_b64 exec, s[10:11] // 000000000ADC: BEFE040A
v_cndmask_b32 v2, v2, 0, s[0:1] // 000000000AE0: D2000002 00010102
v_add_i32 v4, vcc, 56, v3 // 000000000AE8: 4A0806B8
v_add_i32 v1, vcc, v2, v1 // 000000000AEC: 4A020302
v_cmp_lt_u32 s[0:1], v3, 8 // 000000000AF0: D1820000 00011103
v_and_b32 v2, 63, v4 // 000000000AF8: 360408BF
v_lshl_b64 v[4:5], 1, v2 // 000000000AFC: D2C20004 00020481
v_and_b32 v4, exec_lo, v4 // 000000000B04: 3608087E
v_and_b32 v5, exec_hi, v5 // 000000000B08: 360A0A7F
v_cmp_ne_u64 vcc, 0, v[4:5] // 000000000B0C: 7DCA0880
s_orn2_b64 s[0:1], s[0:1], vcc // 000000000B10: 8B806A00
s_mov_b64 s[10:11], exec // 000000000B14: BE8A047E
s_mov_b64 s[12:13], exec // 000000000B18: BE8C047E
label_02C7:
v_readfirstlane_b32 s3, v2 // 000000000B1C: 7E060502
v_cmp_eq_u32 vcc, s3, v2 // 000000000B20: 7D840403
s_and_saveexec_b64 s[14:15], vcc // 000000000B24: BE8E246A
v_readlane_b32 s3, v1, s3 // 000000000B28: 02060701
s_cbranch_execz label_02CF // 000000000B2C: BF880003
v_mov_b32 v2, s3 // 000000000B30: 7E040203
s_andn2_b64 s[12:13], s[12:13], exec // 000000000B34: 8A8C7E0C
s_cbranch_scc0 label_02D1 // 000000000B38: BF840002
label_02CF:
s_mov_b64 exec, s[12:13] // 000000000B3C: BEFE040C
s_branch label_02C7 // 000000000B40: BF82FFF6
label_02D1:
s_mov_b64 exec, s[10:11] // 000000000B44: BEFE040A
v_cndmask_b32 v2, v2, 0, s[0:1] // 000000000B48: D2000002 00010102
v_add_i32 v4, vcc, 48, v3 // 000000000B50: 4A0806B0
v_and_b32 v4, 63, v4 // 000000000B54: 360808BF
v_add_i32 v1, vcc, v2, v1 // 000000000B58: 4A020302
v_cmp_lt_u32 s[0:1], v3, 16 // 000000000B5C: D1820000 00012103
v_lshl_b64 v[5:6], 1, v4 // 000000000B64: D2C20005 00020881
v_and_b32 v5, exec_lo, v5 // 000000000B6C: 360A0A7E
v_and_b32 v6, exec_hi, v6 // 000000000B70: 360C0C7F
v_cmp_ne_u64 vcc, 0, v[5:6] // 000000000B74: 7DCA0A80
s_orn2_b64 s[0:1], s[0:1], vcc // 000000000B78: 8B806A00
s_mov_b64 s[10:11], exec // 000000000B7C: BE8A047E
s_mov_b64 s[12:13], exec // 000000000B80: BE8C047E
label_02E1:
v_readfirstlane_b32 s3, v4 // 000000000B84: 7E060504
v_cmp_eq_u32 vcc, s3, v4 // 000000000B88: 7D840803
s_and_saveexec_b64 s[14:15], vcc // 000000000B8C: BE8E246A
v_readlane_b32 s3, v1, s3 // 000000000B90: 02060701
s_cbranch_execz label_02E9 // 000000000B94: BF880003
v_mov_b32 v2, s3 // 000000000B98: 7E040203
s_andn2_b64 s[12:13], s[12:13], exec // 000000000B9C: 8A8C7E0C
s_cbranch_scc0 label_02EB // 000000000BA0: BF840002
label_02E9:
s_mov_b64 exec, s[12:13] // 000000000BA4: BEFE040C
s_branch label_02E1 // 000000000BA8: BF82FFF6
label_02EB:
s_mov_b64 exec, s[10:11] // 000000000BAC: BEFE040A
v_cndmask_b32 v2, v2, 0, s[0:1] // 000000000BB0: D2000002 00010102
v_add_i32 v4, vcc, 32, v3 // 000000000BB8: 4A0806A0
v_and_b32 v4, 63, v4 // 000000000BBC: 360808BF
v_add_i32 v1, vcc, v2, v1 // 000000000BC0: 4A020302
v_cmp_lt_u32 s[0:1], v3, 32 // 000000000BC4: D1820000 00014103
v_lshl_b64 v[2:3], 1, v4 // 000000000BCC: D2C20002 00020881
v_and_b32 v2, exec_lo, v2 // 000000000BD4: 3604047E
v_and_b32 v3, exec_hi, v3 // 000000000BD8: 3606067F
v_cmp_ne_u64 vcc, 0, v[2:3] // 000000000BDC: 7DCA0480
s_orn2_b64 s[0:1], s[0:1], vcc // 000000000BE0: 8B806A00
s_mov_b64 s[10:11], exec // 000000000BE4: BE8A047E
s_mov_b64 s[12:13], exec // 000000000BE8: BE8C047E
label_02FB:
v_readfirstlane_b32 s3, v4 // 000000000BEC: 7E060504
v_cmp_eq_u32 vcc, s3, v4 // 000000000BF0: 7D840803
s_and_saveexec_b64 s[14:15], vcc // 000000000BF4: BE8E246A
v_readlane_b32 s3, v1, s3 // 000000000BF8: 02060701
s_cbranch_execz label_0303 // 000000000BFC: BF880003
v_mov_b32 v2, s3 // 000000000C00: 7E040203
s_andn2_b64 s[12:13], s[12:13], exec // 000000000C04: 8A8C7E0C
s_cbranch_scc0 label_0305 // 000000000C08: BF840002
label_0303:
s_mov_b64 exec, s[12:13] // 000000000C0C: BEFE040C
s_branch label_02FB // 000000000C10: BF82FFF6
label_0305:
s_mov_b64 exec, s[10:11] // 000000000C14: BEFE040A
v_cndmask_b32 v2, v2, 0, s[0:1] // 000000000C18: D2000002 00010102
s_add_u32 s0, s2, -1 // 000000000C20: 8000C102
v_add_i32 v1, vcc, v2, v1 // 000000000C24: 4A020302
v_mov_b32 v2, s0 // 000000000C28: 7E040200
v_lshl_b64 v[2:3], 1, v2 // 000000000C2C: D2C20002 00020481
v_and_b32 v2, exec_lo, v2 // 000000000C34: 3604047E
v_and_b32 v3, exec_hi, v3 // 000000000C38: 3606067F
v_cmp_ne_u64 vcc, 0, v[2:3] // 000000000C3C: 7DCA0480
v_readlane_b32 s0, v1, s0 // 000000000C40: 02000101
v_mov_b32 v1, s0 // 000000000C44: 7E020200
v_cndmask_b32 v1, 0, v1, vcc // 000000000C48: 00020280
label_0313:
s_mov_b64 s[0:1], exec // 000000000C4C: BE80047E
s_andn2_b64 exec, s[0:1], s[4:5] // 000000000C50: 8AFE0400
v_mov_b32 v2, 0 // 000000000C54: 7E040280
ds_write_b32 v2, v1 // 000000000C58: D8340000 00000102
label_0318:
s_mov_b64 exec, s[8:9] // 000000000C60: BEFE0408
s_waitcnt lgkmcnt(0) // 000000000C64: BF8C007F
s_barrier // 000000000C68: BF8A0000
v_mov_b32 v1, 0 // 000000000C6C: 7E020280
ds_read_b32 v7, v1 // 000000000C70: D8D80000 07000001
s_waitcnt lgkmcnt(0) // 000000000C78: BF8C007F
s_barrier // 000000000C7C: BF8A0000
label_0320:
v_cmp_eq_i32 vcc, 0, v0 // 000000000C80: 7D040080
s_and_saveexec_b64 s[0:1], vcc // 000000000C84: BE80246A
s_cbranch_execz label_0329 // 000000000C88: BF880006
s_load_dwordx2 s[2:3], s[6:7], 0x0c // 000000000C8C: C041070C
s_waitcnt lgkmcnt(0) // 000000000C90: BF8C007F
v_mov_b32 v2, s2 // 000000000C94: 7E040202
v_mov_b32 v3, s3 // 000000000C98: 7E060203
flat_atomic_add v0, v[2:3], v7 // 000000000C9C: DCC80000 00000702
label_0329:
s_endpgm // 000000000CA4: BF810000
end
; ----------------- CS Data ------------------------
codeLenInByte = 3752 bytes;
objectHeaderByteSize = 512 bytes;
; launchModeFlags = 0x00000200
isAql = TRUE;
userElementCount = 0;
extUserElementCount = 0;
NumVgprs = 10;
NumSgprs = 23;
FloatMode = 192;
IeeeMode = 0;
FlatPtr32 = 0;
ScratchSize = 0 dwords/thread;
LDSByteSize = 32 bytes/workgroup (compile time only);
ScratchWaveOffsetReg = s65535;
; texSamplerUsage = 0x00000000
; constBufUsage = 0x00000000
;COMPUTE_PGM_RSRC2 = 0x00009390
COMPUTE_PGM_RSRC2:USER_SGPR = 8
COMPUTE_PGM_RSRC2:TGID_X_EN = 1
COMPUTE_PGM_RSRC2:TGID_Y_EN = 1
COMPUTE_PGM_RSRC2:TGID_Z_EN = 1
COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT = 2
COMPUTE_PGM_RSRC2:LDS_SIZE = 1
; ---------------------------- HSA Code Descriptor ----------------------------
code_type = HSA_EXT_CODE_KERNEL
workgroup_group_segment_byte_size = 32
kernarg_segment_byte_size = 56
code.handle = 256%
kernarg_segment_alignment = 3
group_segment_alignment = 3
wavefront_size = 6
program_call_convention = 0
module.handle = 0
symbol = 3932
hsail_profile = HSA_EXT_BRIG_PROFILE_FULL
hsail_machine_model = HSA_EXT_BRIG_MACHINE_LARGE
debug_information.handle = 0
agent_vendor = AMD
agent_name = gfx7
hsail_version_major = 0
hsail_version_minor = 20140528
; ---------------------------- AMD Code Descriptor ----------------------------
amd_code_version_major = 0
amd_code_version_minor = 0
struct_byte_size = 256
target_chip = gfx7
kernel_code_entry_byte_offset = 256
max_scratch_backing_memory_byte_size = 0
; compute_pgm_rsrc1:
granulated_workitem_vgpr_count = 2
granulated_wavefront_sgpr_count = 3
priority = 0
float_mode_round_32 = ROUND_TO_NEAREST_EVEN
float_mode_round_16_64 = ROUND_TO_NEAREST_EVEN
float_mode_denorm_32 = FLUSH_SOURCE_OUTPUT
float_mode_denorm_16_64 = FLUSH_NONE
; compute_pgm_rsrc2:
user_sgpr_count = 8
enable_sgpr_workgroup_id_x = TRUE
enable_sgpr_workgroup_id_y = TRUE
enable_sgpr_workgroup_id_z = TRUE
enable_vgpr_workitem_id = X, Y, Z
granulated_lds_size = 0
enable_sgpr_private_segment_buffer = TRUE
enable_sgpr_dispatch_ptr = TRUE
enable_sgpr_kernarg_segment_ptr = TRUE
private_element_size = DWORD (4 bytes)
is_ptr64 = TRUE
gds_segment_byte_size = 0
wavefront_sgpr_count = 23
workitem_vgpr_count = 10
kernarg_segment_byte_size = 56
workgroup_group_segment_byte_size = 32
kernarg_segment_alignment = 3
group_segment_alignment = 3
code_alignment = 3
code_type = HSA_EXT_CODE_KERNEL
wavefront_size = 6
optimization_level = 2
hsail_profile = HSA_EXT_BRIG_PROFILE_FULL
hsail_machine_model = HSA_EXT_BRIG_MACHINE_LARGE
hsail_version_major = 0
hsail_version_minor = 20140528
Such long code leads to low performance reductions. What is the actual cause of all these instructions? Does it have to do with the supporting ISA of Bonaire? Does it support the so called "swizzle operations" for exchanging data between lanes in the same wavefront? Or is it just a driver bug/limitation?
Thank you. Just wanted to let you know that we are looking into it and will get back to you.
Regards,
Ravi
Here is an update. A internal bug report has been filed against this issue. Will keep you updated. Thanks for your patience.
Great!
I also tried with the latest driver (15.7). I have to report that though the performance has been improved it does not match the shared memory version. On a Bonaire GPU using partially shared memory and subgroup functions exhibits performance of about 67% of the shared memory only and using the workgroup function shows performance of about a third (34%) of the shared memory only version.
Workgroup and sub-workgroup OpenCL 2.0 function evaluation test case
Platform/Device selection
Total platforms: 2
1. NVIDIA CUDA
1. GeForce GTX 660/NVIDIA Corporation
2. AMD Accelerated Parallel Processing
1. Bonaire/Advanced Micro Devices, Inc.
2. Intel(R) Pentium(R) 4 CPU 3.06GHz/GenuineIntel
Select platform index: 2
Select device index: 1
Device info
Platform: AMD Accelerated Parallel Processing
Device: Bonaire
Driver version: 1800.5 (VM)
OpenCL version: OpenCL 2.0 AMD-APP (1800.5)
Great! OpenCL 2.0 is supported 🙂
Building kernel with options "-cl-std=CL2.0 -cl-uniform-work-group-size -DK3 -DK2 -DWAVEFRONT_SIZE=64"
1. Shared memory only kernel
Executing...Done!
Output: 4294901760 / Time: 0.0706873 msecs (1.85425 billion elements/second)
PASSED!
2. Hybrid kernel via subgroup functions
Executing...Done!
Output: 4294901760 / Time: 0.105353 msecs (1.24413 billion elements/second)
Relative speed-up with respect to kernel 1: 0.67096
PASSED!
3. Workgroup function kernel
Executing...Done!
Output: 4294901760 / Time: 0.203889 msecs (0.64286 billion elements/second)
Relative speed-up with respect to kernel 1: 0.346695
PASSED!
Let me revive this thread. I would have great use of lane shuffle operations exposed, but to most purposes the OpenCL 2.0 workgroup reduction (and some voting ops) would be enough. However, i'd have to be able to trust the implementation do the right thing.
Except that even though the OpenCL 2.0 driver is (officially) out of beta, the workgroup funtions' implementation in the OpenCL 2.0 compiler/runtime seems to still be rather poor. With the OPs microbenchmark, for some workload sizes the "hybrid" reduction can be up to 20% faster on Fiji, but for other sizes it's slower. On Fiji the pure wg version seemed to always be 1.5-2x slower than the shmem-based reduction.
On Hawaii I get very poor performance, the hybrid version is 2-4x slower, the pure WG version 6-10x slower.
I had a hunch and tweaked the hybrid version to let the local memory-based reduction run down to 32 and 16 and do sub_group_reduce_add() across half and quarter vawefront only. Interestingly, this improved the performance on Hawaii; best case I got nearly 2x with sub_group_reduce across a qurter wave on Hawaii with large global work sizes (making Hawaii hierarchical reduction best case 10% slower than local mem-based). On Fiji the same tweaks reduced performance quite a bit.
What's strange is that explicit unrolling of the local mem reduction loop causes the performance to regress. Any ideas why?
All in all, my results show a pretty uncomfortable reality: the advantages of the OCL 2.0 wg functions are greatly diminished by the quality of the compiler / runtime.
According to my experiments on a 3rd generation GCN GPU (Tonga) I confirm that subgroup reductions provided slightly improved performance compared to shared mem reduction. However, workgroup reduction is always slower. Maybe workgroup data sharing is not effective without the use of shared memory. Furthermore, using a 2nd gen GCN GPU (Bonaire) it seems it is always faster to use shared memory reductions and by a wider margin.
OpenCL is about performance and getting slower performance by using new functions practically makes them meaningless.
Thanks for the feedback! I guess you used the default global work size hard-coded in the source, right? For a thorough tests it would make sense to generate performance as a function of work size (I did see a slight dependence). However, as long as the implementation (and/or docs) does not improve, the current feedback should be enough incentive for AMD to take action.
Have you tried the subgroup reduction across half or quarter wave? That improved performance for me on Hawaii.
Both workgroup and NDrange sizes are hardcoded in the source (256 and 256x64x8=131072, respectively). In order to keep it simple I didn't apply any scaling on the NDRange, just a fixed one. At least this performance behavior is not intolerable on GCN 3rd gen devices.
I guess that partial reductions on wave are not valid. In paragraph "9.17.3.4" of the opencl-2.0-extensions manual by Khronos it states:
The OpenCL C programming language implements the following built-in functions that operate on a sub-group level. These built-in functions must be encountered by all work-items in a subgroup executing the kernel.
These functions include subgroup reductions.
This issue has not escaped our attention, but there is a great deal of competition for that attention. Also, we don't have quite the flexibility of implementation in the library as do OpenCL kernel developers. Finally, cross-lane support is not very strong prior to "Volcanic Islands" ASICs.
We are planning to improve our implementation in an upcoming release.
Thank you for your response.
The experimental results I've got are consistent with your statement on VI ASICs.
Any improvement on the efficiency of workgroup/subgroup functions is welcome.
> Finally, cross-lane support is not very strong prior to "Volcanic Islands" ASICs.
Could you elaborate what does that mean, e.g. in terms of bytes/clock shifted across SIMD lanes? I can't find much information except some slides that were floating around on the Internet (can't find them right now) that claimed improved lane shuffle in Fiji, I think.
> We are planning to improve our implementation in an upcoming release.
Improving workgroup reduction to make it at least match the local-memory-based implementations is in my opinion important if not deal-breaker to adoption.
However, in my opinion, you should strongly consider exposing ASAP cross-lane operations through a CL extension. This is a primitive that can allow elegant implementation of many algorithms/parallel primitives. Given how great has been the reception of "warp shuffle" introduced by NV 2 gen ago, I expect that your dev community would applaud and adopt the feature quickly.