cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ekondis
Adept II

Poor workgroup reduction function performance (OpenCL 2.0)

I had previously reported this but here I provide a test case for the examination of workgroup reduction function. Kernels perform a workgroup reduction in 3 ways:

1) The classical one with shared memory (OpenCL 1.2)

2) Shared memory plus sub-group reduction function on the final stage

3) Workgroup reduction function (no shared memory at all)

I tested it on a R7-260X and the latter two kernels prove to be significantly slower than reduction in shared memory. The last one especially is more than 5 times slower than using pure shared memory. This fact eliminates the value of these new functions in OpenCL 2.0. AFAIK, GCN GPUs feature swizzle operations which would could potentially make workgroup functions quite efficient. This is not the case however.

In addition the CodeXL 1.6 static kernel analyser does not support OpenCL 2.0 kernels. Therefore, I cannot investigate the disassembled kernel code.

Code on Github: https://github.com/ekondis/cl2-reduce-bench

More details on blog: http://parallelplusplus.blogspot.gr/2014/12/workgroup-reduction-function-evaluation.html

0 Likes
16 Replies
dipak
Big Boss

Thanks for sharing the testcase. We'll check and get back to you.

Regards,

0 Likes
ekondis
Adept II


I usually don't trust any measurement taking less than 4-5 ms. On my system, those are normal fluctuations! I assume using profile mode takes care of that...


... I'm pretty sure CL profile mode has been invented to be used, I don't trust it. As most APIs have historically been easygoing on the profile data, I wouldn't take for granted the results are coherent...


besides I cannot exclude my app had a small overhead with profile on.



While the problem of reductions might be useful for some cases, I don't see value in a kernel doing only this, it's not relevant to me. In general, I've noticed using advanced functionality makes simple kernels slower (I have been told they take more time to "set up") while making complex kernels usually faster. This kernel is super simple, its only memory access is in writing out the result. There's no other workload and this severely hampers HW ability to do things optimally.


The results seem to be quite stable. No significant fluctuations between executions have been observed. Even if profiling contributes extra latency it should be mostly identical on all kernel executions. And it's a comparison between the three executions after all.

The kernel is deliberately simple as I want to expose the performance of the specific workgroup function and not to include other factors that might affect performance. The three kernels differ only in the reduction part since it is the focus of the comparison.

Just to mention that I've seen in a few known GPU benchmarks kernel executions that take just a couple of milliseconds (e.g. Rodinia).

0 Likes

When someone argues that your test methodology is wrong, the answer is to run you test millions of times, and show the distribution of the running times.

0 Likes

...or the one who argues could supply results (millions?) in order to support his argument. The truth is that I don't have time to prove the obvious.

0 Likes

One thousand times would certainly enough by far, I was kidding...It would only takes seconds...

0 Likes
ekondis
Adept II

Unfortunately, I still haven't got any reply on the issue.

However, I got a reminder note about saving temporary files with the "-save-temps" compilation option, so I was able to get the disassembly code of all 3 kernels.

The results are quite revealing of the computational cost each kernel. I'll focus on the last kernel which makes use of the workgroup reduction function. The kernel's code is quite simple:


__kernel void reductionWkgrp(__global uint *result) {


  const uint id = get_global_id(0);


  const uint lid = get_local_id(0);


  uint res = id;


  // workgroup reduction (introduced in OpenCL 2.0)


  res = work_group_reduce_add(res);


  // atomic reduce in global memory


  if(lid==0)


    atomic_add(result, res);


}


The generated disassembly code on a Bonaire GPU is enormous:


ShaderType = IL_SHADER_COMPUTE


TargetChip = t


; ------------- SC_SRCSHADER Dump ------------------


SC_SHADERSTATE: u32NumIntVSConst = 0


SC_SHADERSTATE: u32NumIntPSConst = 0


SC_SHADERSTATE: u32NumIntGSConst = 0


SC_SHADERSTATE: u32NumBoolVSConst = 0


SC_SHADERSTATE: u32NumBoolPSConst = 0


SC_SHADERSTATE: u32NumBoolGSConst = 0


SC_SHADERSTATE: u32NumFloatVSConst = 0


SC_SHADERSTATE: u32NumFloatPSConst = 0


SC_SHADERSTATE: u32NumFloatGSConst = 0


u32SCOptions[0] = 0x00680000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC


u32SCOptions[1] = 0x00000000


u32SCOptions[2] = 0x08000000 SCOption_R1000_READLANE_SMRD_WORKAROUND_BUG343479


u32SCOptions[3] = 0x00001284 SCOption_R1000_BARRIER_WORKAROUND_BUG405404 SCOption_R1000_IFORK_BACKJUMP_WORKAROUND_BUG368004 SCOption_R1000R1100_VCCZ_CLOBBER_WORKAROUND_BUG457939 SCOption_R1100_FLAT_SCRATCH_OOB_WORKAROUND_BUG379895


u32SCOptions[4] = 0x00000000


; -------- Disassembly --------------------


shader main


  asic(CI)


  type(CS)




  s_load_dwordx4  s[12:15], s[4:5], 0x01                // 000000000000: C0860501


  s_mov_b32     m0, 0x00010000                          // 000000000004: BEFC03FF 00010000


  s_waitcnt     lgkmcnt(0)                              // 00000000000C: BF8C007F


  s_bfe_u32     s11, s12, 0x00100000                    // 000000000010: 938BFF0C 00100000


  s_bfe_u32     s12, s12, 0x00100010                    // 000000000018: 938CFF0C 00100010


  s_bfe_u32     s13, s13, 0x00100000                    // 000000000020: 938DFF0D 00100000


  v_mul_lo_u32  v2, v2, s12                             // 000000000028: D2D20002 00001902


  v_add_i32     v1, vcc, v2, v1                         // 000000000030: 4A020302


  v_mul_lo_u32  v1, v1, s11                             // 000000000034: D2D20001 00001701


  v_add_i32     v1, vcc, v1, v0                         // 00000000003C: 4A020101


  s_cmp_eq_u32  0, s11                                  // 000000000040: BF060B80


  s_cselect_b32  s0, -1, 0                              // 000000000044: 850080C1


  s_cmp_eq_u32  0, s11                                  // 000000000048: BF060B80


  s_cbranch_scc1  label_0030                            // 00000000004C: BF85001C


  s_cmp_gt_u32  s11, s14                                // 000000000050: BF080E0B


  s_cbranch_scc1  label_0030                            // 000000000054: BF85001A


  s_flbit_i32_b32  s1, s11                              // 000000000058: BE81150B


  s_flbit_i32_b32  s2, s14                              // 00000000005C: BE82150E


  s_sub_i32     s20, s1, s2                             // 000000000060: 81940201


  s_lshl_b32    s1, s11, s20                            // 000000000064: 8F01140B


  s_sub_u32     s17, s14, s1                            // 000000000068: 8091010E


  s_cselect_b32  s21, s14, s17                          // 00000000006C: 8515110E


  s_cselect_b32  s18, 0, 1                              // 000000000070: 85128180


  s_add_u32     s3, s0, s18                             // 000000000074: 80031200


  s_cmp_eq_u32  0, s20                                  // 000000000078: BF061480


  s_cbranch_scc0  label_0022                            // 00000000007C: BF840002


  s_mov_b32     s0, s3                                  // 000000000080: BE800303


  s_branch      label_0030                              // 000000000084: BF82000E


label_0022:


  s_mov_b32     s2, s3                                  // 000000000088: BE820303


label_0023:


  s_add_u32     s2, s2, s2                              // 00000000008C: 80020202


  s_lshr_b32    s1, s1, 1                               // 000000000090: 90018101


  s_sub_u32     s18, s21, s1                            // 000000000094: 80920115


  s_cselect_b32  s17, s21, s18                          // 000000000098: 85111215


  s_cselect_b32  s19, 0, 1                              // 00000000009C: 85138180


  s_add_u32     s16, s2, s19                            // 0000000000A0: 80101302


  s_addk_i32    s20, 0xffff                             // 0000000000A4: B794FFFF


  s_cmp_le_u32  s20, 0                                  // 0000000000A8: BF0B8014


  s_cbranch_scc1  label_002F                            // 0000000000AC: BF850003


  s_mov_b32     s2, s16                                 // 0000000000B0: BE820310


  s_mov_b32     s21, s17                                // 0000000000B4: BE950311


  s_branch      label_0023                              // 0000000000B8: BF82FFF4


label_002F:


  s_mov_b32     s0, s16                                 // 0000000000BC: BE800310


label_0030:


  s_mul_i32     s1, s0, s11                             // 0000000000C0: 93010B00


  s_sub_i32     s1, s14, s1                             // 0000000000C4: 8181010E


  s_cmp_eq_u32  0, s12                                  // 0000000000C8: BF060C80


  s_cselect_b32  s2, -1, 0                              // 0000000000CC: 850280C1


  s_cmp_eq_u32  0, s12                                  // 0000000000D0: BF060C80


  s_cbranch_scc1  label_0052                            // 0000000000D4: BF85001C


  s_cmp_gt_u32  s12, s15                                // 0000000000D8: BF080F0C


  s_cbranch_scc1  label_0052                            // 0000000000DC: BF85001A


  s_flbit_i32_b32  s3, s12                              // 0000000000E0: BE83150C


  s_flbit_i32_b32  s14, s15                             // 0000000000E4: BE8E150F


  s_sub_i32     s21, s3, s14                            // 0000000000E8: 81950E03


  s_lshl_b32    s3, s12, s21                            // 0000000000EC: 8F03150C


  s_sub_u32     s18, s15, s3                            // 0000000000F0: 8092030F


  s_cselect_b32  s22, s15, s18                          // 0000000000F4: 8516120F


  s_cselect_b32  s19, 0, 1                              // 0000000000F8: 85138180


  s_add_u32     s16, s2, s19                            // 0000000000FC: 80101302


  s_cmp_eq_u32  0, s21                                  // 000000000100: BF061580


  s_cbranch_scc0  label_0044                            // 000000000104: BF840002


  s_mov_b32     s2, s16                                 // 000000000108: BE820310


  s_branch      label_0052                              // 00000000010C: BF82000E


label_0044:


  s_mov_b32     s14, s16                                // 000000000110: BE8E0310


label_0045:


  s_add_u32     s14, s14, s14                           // 000000000114: 800E0E0E


  s_lshr_b32    s3, s3, 1                               // 000000000118: 90038103


  s_sub_u32     s19, s22, s3                            // 00000000011C: 80930316


  s_cselect_b32  s18, s22, s19                          // 000000000120: 85121316


  s_cselect_b32  s20, 0, 1                              // 000000000124: 85148180


  s_add_u32     s17, s14, s20                           // 000000000128: 8011140E


  s_addk_i32    s21, 0xffff                             // 00000000012C: B795FFFF


  s_cmp_le_u32  s21, 0                                  // 000000000130: BF0B8015


  s_cbranch_scc1  label_0051                            // 000000000134: BF850003


  s_mov_b32     s14, s17                                // 000000000138: BE8E0311


  s_mov_b32     s22, s18                                // 00000000013C: BE960312


  s_branch      label_0045                              // 000000000140: BF82FFF4


label_0051:


  s_mov_b32     s2, s17                                 // 000000000144: BE820311


label_0052:


  s_mul_i32     s3, s2, s12                             // 000000000148: 93030C02


  s_sub_i32     s3, s15, s3                             // 00000000014C: 8183030F


  s_cmp_eq_u32  0, s13                                  // 000000000150: BF060D80


  s_cselect_b32  s17, -1, 0                             // 000000000154: 851180C1


  s_load_dword  s4, s[4:5], 0x05                        // 000000000158: C0020505


  s_cmp_eq_u32  0, s13                                  // 00000000015C: BF060D80


  s_cbranch_scc1  label_0075                            // 000000000160: BF85001C


  s_waitcnt     lgkmcnt(0)                              // 000000000164: BF8C007F


  s_cmp_gt_u32  s13, s4                                 // 000000000168: BF08040D


  s_cbranch_scc1  label_0075                            // 00000000016C: BF850019


  s_flbit_i32_b32  s5, s13                              // 000000000170: BE85150D


  s_flbit_i32_b32  s15, s4                              // 000000000174: BE8F1504


  s_sub_i32     s5, s5, s15                             // 000000000178: 81850F05


  s_lshl_b32    s21, s13, s5                            // 00000000017C: 8F15050D


  s_sub_u32     s18, s4, s21                            // 000000000180: 80921504


  s_cselect_b32  s22, s4, s18                           // 000000000184: 85161204


  s_cselect_b32  s19, 0, 1                              // 000000000188: 85138180


  s_add_u32     s16, s17, s19                           // 00000000018C: 80101311


  s_cmp_eq_u32  0, s5                                   // 000000000190: BF060580


  s_cbranch_scc0  label_0068                            // 000000000194: BF840002


  s_mov_b32     s17, s16                                // 000000000198: BE910310


  s_branch      label_0075                              // 00000000019C: BF82000D


label_0068:


  s_mov_b32     s15, s16                                // 0000000001A0: BE8F0310


label_0069:


  s_add_u32     s15, s15, s15                           // 0000000001A4: 800F0F0F


  s_lshr_b32    s21, s21, 1                             // 0000000001A8: 90158115


  s_sub_u32     s19, s22, s21                           // 0000000001AC: 80931516


  s_cselect_b32  s18, s22, s19                          // 0000000001B0: 85121316


  s_cselect_b32  s20, 0, 1                              // 0000000001B4: 85148180


  s_add_u32     s17, s15, s20                           // 0000000001B8: 8011140F


  s_addk_i32    s5, 0xffff                              // 0000000001BC: B785FFFF


  s_cmp_le_u32  s5, 0                                   // 0000000001C0: BF0B8005


  s_cbranch_scc1  label_0075                            // 0000000001C4: BF850003


  s_mov_b32     s15, s17                                // 0000000001C8: BE8F0311


  s_mov_b32     s22, s18                                // 0000000001CC: BE960312


  s_branch      label_0069                              // 0000000001D0: BF82FFF4


label_0075:


  s_mul_i32     s14, s17, s13                           // 0000000001D4: 930E0D11


  s_waitcnt     lgkmcnt(0)                              // 0000000001D8: BF8C007F


  s_sub_i32     s4, s4, s14                             // 0000000001DC: 81840E04


  s_cmp_lt_u32  s8, s0                                  // 0000000001E0: BF0A0008


  s_cselect_b32  s0, s11, s1                            // 0000000001E4: 8500010B


  s_cmp_lt_u32  s9, s2                                  // 0000000001E8: BF0A0209


  s_cselect_b32  s1, s12, s3                            // 0000000001EC: 8501030C


  s_cmp_lt_u32  s10, s17                                // 0000000001F0: BF0A110A


  s_cselect_b32  s2, s13, s4                            // 0000000001F4: 8502040D


  s_mul_i32     s3, s11, s8                             // 0000000001F8: 9303080B


  s_load_dwordx2  s[4:5], s[6:7], 0x00                  // 0000000001FC: C0420700


  v_cmp_ne_u32  s[8:9], 1, 0                            // 000000000200: D18A0008 00010081


  s_mul_i32     s0, s1, s0                              // 000000000208: 93000001


  s_mul_i32     s0, s0, s2                              // 00000000020C: 93000200


  s_waitcnt     lgkmcnt(0)                              // 000000000210: BF8C007F


  s_add_u32     s1, s3, s4                              // 000000000214: 80010403


  s_add_u32     s0, s0, 63                              // 000000000218: 8000BF00


  s_bcnt1_i32_b64  s2, s[8:9]                           // 00000000021C: BE821008


  v_add_i32     v2, vcc, s1, v0                         // 000000000220: 4A040001


  s_lshr_b32    s0, s0, 6                               // 000000000224: 90008600


  v_mbcnt_lo_u32_b32  v3, -1, 0                         // 000000000228: D2460003 000100C1


  v_mbcnt_hi_u32_b32  v3, -1, v3 vop3                   // 000000000230: D2480003 000206C1


  v_bcnt_u32_b32  v4, s2, 0                             // 000000000238: D2440004 00010002


  v_mov_b32     v5, 1                                   // 000000000240: 7E0A0281


  v_cmp_eq_i32  vcc, v4, v5                             // 000000000244: 7D040B04


  s_cbranch_vccz  label_011E                            // 000000000248: BF86008B


  v_xor_b32     v5, 1, v3                               // 00000000024C: 3A0A0681


  v_lshl_b64    v[6:7], 1, v5                           // 000000000250: D2C20006 00020A81


  v_and_b32     v6, exec_lo, v6                         // 000000000258: 360C0C7E


  v_and_b32     v7, exec_hi, v7                         // 00000000025C: 360E0E7F


  v_cmp_ne_u64  s[4:5], 0, v[6:7]                       // 000000000260: D1CA0004 00020C80


  s_mov_b64     s[8:9], exec                            // 000000000268: BE88047E


  s_mov_b64     s[10:11], exec                          // 00000000026C: BE8A047E


label_009C:


  v_readfirstlane_b32  s1, v5                           // 000000000270: 7E020505


  v_cmp_eq_u32  vcc, s1, v5                             // 000000000274: 7D840A01


  s_and_saveexec_b64  s[12:13], vcc                     // 000000000278: BE8C246A


  v_readlane_b32  s1, v2, s1                            // 00000000027C: 02020302


  s_cbranch_execz  label_00A4                           // 000000000280: BF880003


  v_mov_b32     v5, s1                                  // 000000000284: 7E0A0201


  s_andn2_b64   s[10:11], s[10:11], exec                // 000000000288: 8A8A7E0A


  s_cbranch_scc0  label_00A6                            // 00000000028C: BF840002


label_00A4:


  s_and_b64     exec, s[12:13], s[10:11]                // 000000000290: 87FE0A0C


  s_branch      label_009C                              // 000000000294: BF82FFF6


label_00A6:


  s_mov_b64     exec, s[8:9]                            // 000000000298: BEFE0408


  v_cndmask_b32  v5, 0, v5, s[4:5]                      // 00000000029C: D2000005 00120A80


  v_add_i32     v2, vcc, v5, v2                         // 0000000002A4: 4A040505


  v_xor_b32     v5, 2, v3                               // 0000000002A8: 3A0A0682


  v_lshl_b64    v[6:7], 1, v5                           // 0000000002AC: D2C20006 00020A81


  v_and_b32     v6, exec_lo, v6                         // 0000000002B4: 360C0C7E


  v_and_b32     v7, exec_hi, v7                         // 0000000002B8: 360E0E7F


  v_cmp_ne_u64  s[4:5], 0, v[6:7]                       // 0000000002BC: D1CA0004 00020C80


  s_mov_b64     s[8:9], exec                            // 0000000002C4: BE88047E


  s_mov_b64     s[10:11], exec                          // 0000000002C8: BE8A047E


label_00B3:


  v_readfirstlane_b32  s1, v5                           // 0000000002CC: 7E020505


  v_cmp_eq_u32  vcc, s1, v5                             // 0000000002D0: 7D840A01


  s_and_saveexec_b64  s[12:13], vcc                     // 0000000002D4: BE8C246A


  v_readlane_b32  s1, v2, s1                            // 0000000002D8: 02020302


  s_cbranch_execz  label_00BB                           // 0000000002DC: BF880003


  v_mov_b32     v5, s1                                  // 0000000002E0: 7E0A0201


  s_andn2_b64   s[10:11], s[10:11], exec                // 0000000002E4: 8A8A7E0A


  s_cbranch_scc0  label_00BD                            // 0000000002E8: BF840002


label_00BB:


  s_and_b64     exec, s[12:13], s[10:11]                // 0000000002EC: 87FE0A0C


  s_branch      label_00B3                              // 0000000002F0: BF82FFF6


label_00BD:


  s_mov_b64     exec, s[8:9]                            // 0000000002F4: BEFE0408


  v_cndmask_b32  v5, 0, v5, s[4:5]                      // 0000000002F8: D2000005 00120A80


  v_add_i32     v2, vcc, v5, v2                         // 000000000300: 4A040505


  v_xor_b32     v5, 4, v3                               // 000000000304: 3A0A0684


  v_lshl_b64    v[6:7], 1, v5                           // 000000000308: D2C20006 00020A81


  v_and_b32     v6, exec_lo, v6                         // 000000000310: 360C0C7E


  v_and_b32     v7, exec_hi, v7                         // 000000000314: 360E0E7F


  v_cmp_ne_u64  s[4:5], 0, v[6:7]                       // 000000000318: D1CA0004 00020C80


  s_mov_b64     s[8:9], exec                            // 000000000320: BE88047E


  s_mov_b64     s[10:11], exec                          // 000000000324: BE8A047E


label_00CA:


  v_readfirstlane_b32  s1, v5                           // 000000000328: 7E020505


  v_cmp_eq_u32  vcc, s1, v5                             // 00000000032C: 7D840A01


  s_and_saveexec_b64  s[12:13], vcc                     // 000000000330: BE8C246A


  v_readlane_b32  s1, v2, s1                            // 000000000334: 02020302


  s_cbranch_execz  label_00D2                           // 000000000338: BF880003


  v_mov_b32     v5, s1                                  // 00000000033C: 7E0A0201


  s_andn2_b64   s[10:11], s[10:11], exec                // 000000000340: 8A8A7E0A


  s_cbranch_scc0  label_00D4                            // 000000000344: BF840002


label_00D2:


  s_and_b64     exec, s[12:13], s[10:11]                // 000000000348: 87FE0A0C


  s_branch      label_00CA                              // 00000000034C: BF82FFF6


label_00D4:


  s_mov_b64     exec, s[8:9]                            // 000000000350: BEFE0408


  v_cndmask_b32  v5, 0, v5, s[4:5]                      // 000000000354: D2000005 00120A80


  v_add_i32     v2, vcc, v5, v2                         // 00000000035C: 4A040505


  v_xor_b32     v5, 32, v3                              // 000000000360: 3A0A06A0


  v_xor_b32     v6, 16, v3                              // 000000000364: 3A0C0690


  v_xor_b32     v7, 8, v3                               // 000000000368: 3A0E0688


  v_lshl_b64    v[8:9], 1, v7                           // 00000000036C: D2C20008 00020E81


  v_and_b32     v8, exec_lo, v8                         // 000000000374: 3610107E


  v_and_b32     v9, exec_hi, v9                         // 000000000378: 3612127F


  v_cmp_ne_u64  s[4:5], 0, v[8:9]                       // 00000000037C: D1CA0004 00021080


  s_mov_b64     s[8:9], exec                            // 000000000384: BE88047E


  s_mov_b64     s[10:11], exec                          // 000000000388: BE8A047E


label_00E3:


  v_readfirstlane_b32  s1, v7                           // 00000000038C: 7E020507


  v_cmp_eq_u32  vcc, s1, v7                             // 000000000390: 7D840E01


  s_and_saveexec_b64  s[12:13], vcc                     // 000000000394: BE8C246A


  v_readlane_b32  s1, v2, s1                            // 000000000398: 02020302


  s_cbranch_execz  label_00EB                           // 00000000039C: BF880003


  v_mov_b32     v7, s1                                  // 0000000003A0: 7E0E0201


  s_andn2_b64   s[10:11], s[10:11], exec                // 0000000003A4: 8A8A7E0A


  s_cbranch_scc0  label_00ED                            // 0000000003A8: BF840002


label_00EB:


  s_and_b64     exec, s[12:13], s[10:11]                // 0000000003AC: 87FE0A0C


  s_branch      label_00E3                              // 0000000003B0: BF82FFF6


label_00ED:


  s_mov_b64     exec, s[8:9]                            // 0000000003B4: BEFE0408


  v_cndmask_b32  v7, 0, v7, s[4:5]                      // 0000000003B8: D2000007 00120E80


  v_add_i32     v2, vcc, v7, v2                         // 0000000003C0: 4A040507


  v_lshl_b64    v[7:8], 1, v6                           // 0000000003C4: D2C20007 00020C81


  v_and_b32     v7, exec_lo, v7                         // 0000000003CC: 360E0E7E


  v_and_b32     v8, exec_hi, v8                         // 0000000003D0: 3610107F


  v_cmp_ne_u64  s[4:5], 0, v[7:8]                       // 0000000003D4: D1CA0004 00020E80


  s_mov_b64     s[8:9], exec                            // 0000000003DC: BE88047E


  s_mov_b64     s[10:11], exec                          // 0000000003E0: BE8A047E


label_00F9:


  v_readfirstlane_b32  s1, v6                           // 0000000003E4: 7E020506


  v_cmp_eq_u32  vcc, s1, v6                             // 0000000003E8: 7D840C01


  s_and_saveexec_b64  s[12:13], vcc                     // 0000000003EC: BE8C246A


  v_readlane_b32  s1, v2, s1                            // 0000000003F0: 02020302


  s_cbranch_execz  label_0101                           // 0000000003F4: BF880003


  v_mov_b32     v6, s1                                  // 0000000003F8: 7E0C0201


  s_andn2_b64   s[10:11], s[10:11], exec                // 0000000003FC: 8A8A7E0A


  s_cbranch_scc0  label_0103                            // 000000000400: BF840002


label_0101:


  s_and_b64     exec, s[12:13], s[10:11]                // 000000000404: 87FE0A0C


  s_branch      label_00F9                              // 000000000408: BF82FFF6


label_0103:


  s_mov_b64     exec, s[8:9]                            // 00000000040C: BEFE0408


  v_cndmask_b32  v6, 0, v6, s[4:5]                      // 000000000410: D2000006 00120C80


  v_add_i32     v2, vcc, v6, v2                         // 000000000418: 4A040506


  v_lshl_b64    v[6:7], 1, v5                           // 00000000041C: D2C20006 00020A81


  v_and_b32     v6, exec_lo, v6                         // 000000000424: 360C0C7E


  v_and_b32     v7, exec_hi, v7                         // 000000000428: 360E0E7F


  v_cmp_ne_u64  s[4:5], 0, v[6:7]                       // 00000000042C: D1CA0004 00020C80


  s_mov_b64     s[8:9], exec                            // 000000000434: BE88047E


  s_mov_b64     s[10:11], exec                          // 000000000438: BE8A047E


label_010F:


  v_readfirstlane_b32  s1, v5                           // 00000000043C: 7E020505


  v_cmp_eq_u32  vcc, s1, v5                             // 000000000440: 7D840A01


  s_and_saveexec_b64  s[12:13], vcc                     // 000000000444: BE8C246A


  v_readlane_b32  s1, v2, s1                            // 000000000448: 02020302


  s_cbranch_execz  label_0117                           // 00000000044C: BF880003


  v_mov_b32     v5, s1                                  // 000000000450: 7E0A0201


  s_andn2_b64   s[10:11], s[10:11], exec                // 000000000454: 8A8A7E0A


  s_cbranch_scc0  label_0119                            // 000000000458: BF840002


label_0117:


  s_and_b64     exec, s[12:13], s[10:11]                // 00000000045C: 87FE0A0C


  s_branch      label_010F                              // 000000000460: BF82FFF6


label_0119:


  s_mov_b64     exec, s[8:9]                            // 000000000464: BEFE0408


  v_cndmask_b32  v5, 0, v5, s[4:5]                      // 000000000468: D2000005 00120A80


  v_add_i32     v7, vcc, v5, v2                         // 000000000470: 4A0E0505


  s_branch      label_01C4                              // 000000000474: BF8200A6


label_011E:


  v_cmp_eq_i32  s[4:5], v3, 0                           // 000000000478: D1040004 00010103


  v_add_i32     v5, vcc, 63, v3                         // 000000000480: 4A0A06BF


  v_and_b32     v5, 63, v5                              // 000000000484: 360A0ABF


  v_lshl_b64    v[6:7], 1, v5                           // 000000000488: D2C20006 00020A81


  v_and_b32     v6, exec_lo, v6                         // 000000000490: 360C0C7E


  v_and_b32     v7, exec_hi, v7                         // 000000000494: 360E0E7F


  v_cmp_ne_u64  vcc, 0, v[6:7]                          // 000000000498: 7DCA0C80


  s_orn2_b64    s[4:5], s[4:5], vcc                     // 00000000049C: 8B846A04


  s_mov_b64     s[8:9], exec                            // 0000000004A0: BE88047E


  s_mov_b64     s[10:11], exec                          // 0000000004A4: BE8A047E


label_012A:


  v_readfirstlane_b32  s1, v5                           // 0000000004A8: 7E020505


  v_cmp_eq_u32  vcc, s1, v5                             // 0000000004AC: 7D840A01


  s_and_saveexec_b64  s[12:13], vcc                     // 0000000004B0: BE8C246A


  v_readlane_b32  s1, v2, s1                            // 0000000004B4: 02020302


  s_cbranch_execz  label_0132                           // 0000000004B8: BF880003


  v_mov_b32     v5, s1                                  // 0000000004BC: 7E0A0201


  s_andn2_b64   s[10:11], s[10:11], exec                // 0000000004C0: 8A8A7E0A


  s_cbranch_scc0  label_0134                            // 0000000004C4: BF840002


label_0132:


  s_and_b64     exec, s[12:13], s[10:11]                // 0000000004C8: 87FE0A0C


  s_branch      label_012A                              // 0000000004CC: BF82FFF6


label_0134:


  s_mov_b64     exec, s[8:9]                            // 0000000004D0: BEFE0408


  v_cndmask_b32  v5, v5, 0, s[4:5]                      // 0000000004D4: D2000005 00110105


  v_add_i32     v2, vcc, v5, v2                         // 0000000004DC: 4A040505


  v_cmp_lt_u32  s[4:5], v3, 2                           // 0000000004E0: D1820004 00010503


  v_add_i32     v5, vcc, 62, v3                         // 0000000004E8: 4A0A06BE


  v_and_b32     v5, 63, v5                              // 0000000004EC: 360A0ABF


  v_lshl_b64    v[6:7], 1, v5                           // 0000000004F0: D2C20006 00020A81


  v_and_b32     v6, exec_lo, v6                         // 0000000004F8: 360C0C7E


  v_and_b32     v7, exec_hi, v7                         // 0000000004FC: 360E0E7F


  v_cmp_ne_u64  vcc, 0, v[6:7]                          // 000000000500: 7DCA0C80


  s_orn2_b64    s[4:5], s[4:5], vcc                     // 000000000504: 8B846A04


  s_mov_b64     s[8:9], exec                            // 000000000508: BE88047E


  s_mov_b64     s[10:11], exec                          // 00000000050C: BE8A047E


label_0144:


  v_readfirstlane_b32  s1, v5                           // 000000000510: 7E020505


  v_cmp_eq_u32  vcc, s1, v5                             // 000000000514: 7D840A01


  s_and_saveexec_b64  s[12:13], vcc                     // 000000000518: BE8C246A


  v_readlane_b32  s1, v2, s1                            // 00000000051C: 02020302


  s_cbranch_execz  label_014C                           // 000000000520: BF880003


  v_mov_b32     v5, s1                                  // 000000000524: 7E0A0201


  s_andn2_b64   s[10:11], s[10:11], exec                // 000000000528: 8A8A7E0A


  s_cbranch_scc0  label_014E                            // 00000000052C: BF840002


label_014C:


  s_and_b64     exec, s[12:13], s[10:11]                // 000000000530: 87FE0A0C


  s_branch      label_0144                              // 000000000534: BF82FFF6


label_014E:


  s_mov_b64     exec, s[8:9]                            // 000000000538: BEFE0408


  v_cndmask_b32  v5, v5, 0, s[4:5]                      // 00000000053C: D2000005 00110105


  v_add_i32     v2, vcc, v5, v2                         // 000000000544: 4A040505


  v_cmp_lt_u32  s[4:5], v3, 4                           // 000000000548: D1820004 00010903


  v_add_i32     v5, vcc, 60, v3                         // 000000000550: 4A0A06BC


  v_and_b32     v5, 63, v5                              // 000000000554: 360A0ABF


  v_lshl_b64    v[6:7], 1, v5                           // 000000000558: D2C20006 00020A81


  v_and_b32     v6, exec_lo, v6                         // 000000000560: 360C0C7E


  v_and_b32     v7, exec_hi, v7                         // 000000000564: 360E0E7F


  v_cmp_ne_u64  vcc, 0, v[6:7]                          // 000000000568: 7DCA0C80


  s_orn2_b64    s[4:5], s[4:5], vcc                     // 00000000056C: 8B846A04


  s_mov_b64     s[8:9], exec                            // 000000000570: BE88047E


  s_mov_b64     s[10:11], exec                          // 000000000574: BE8A047E


label_015E:


  v_readfirstlane_b32  s1, v5                           // 000000000578: 7E020505


  v_cmp_eq_u32  vcc, s1, v5                             // 00000000057C: 7D840A01


  s_and_saveexec_b64  s[12:13], vcc                     // 000000000580: BE8C246A


  v_readlane_b32  s1, v2, s1                            // 000000000584: 02020302


  s_cbranch_execz  label_0166                           // 000000000588: BF880003


  v_mov_b32     v5, s1                                  // 00000000058C: 7E0A0201


  s_andn2_b64   s[10:11], s[10:11], exec                // 000000000590: 8A8A7E0A


  s_cbranch_scc0  label_0168                            // 000000000594: BF840002


label_0166:


  s_and_b64     exec, s[12:13], s[10:11]                // 000000000598: 87FE0A0C


  s_branch      label_015E                              // 00000000059C: BF82FFF6


label_0168:


  s_mov_b64     exec, s[8:9]                            // 0000000005A0: BEFE0408


  v_cndmask_b32  v5, v5, 0, s[4:5]                      // 0000000005A4: D2000005 00110105


  v_add_i32     v6, vcc, 56, v3                         // 0000000005AC: 4A0C06B8


  v_add_i32     v2, vcc, v5, v2                         // 0000000005B0: 4A040505


  v_cmp_lt_u32  s[4:5], v3, 8                           // 0000000005B4: D1820004 00011103


  v_and_b32     v5, 63, v6                              // 0000000005BC: 360A0CBF


  v_lshl_b64    v[6:7], 1, v5                           // 0000000005C0: D2C20006 00020A81


  v_and_b32     v6, exec_lo, v6                         // 0000000005C8: 360C0C7E


  v_and_b32     v7, exec_hi, v7                         // 0000000005CC: 360E0E7F


  v_cmp_ne_u64  vcc, 0, v[6:7]                          // 0000000005D0: 7DCA0C80


  s_orn2_b64    s[4:5], s[4:5], vcc                     // 0000000005D4: 8B846A04


  s_mov_b64     s[8:9], exec                            // 0000000005D8: BE88047E


  s_mov_b64     s[10:11], exec                          // 0000000005DC: BE8A047E


label_0178:


  v_readfirstlane_b32  s1, v5                           // 0000000005E0: 7E020505


  v_cmp_eq_u32  vcc, s1, v5                             // 0000000005E4: 7D840A01


  s_and_saveexec_b64  s[12:13], vcc                     // 0000000005E8: BE8C246A


  v_readlane_b32  s1, v2, s1                            // 0000000005EC: 02020302


  s_cbranch_execz  label_0180                           // 0000000005F0: BF880003


  v_mov_b32     v5, s1                                  // 0000000005F4: 7E0A0201


  s_andn2_b64   s[10:11], s[10:11], exec                // 0000000005F8: 8A8A7E0A


  s_cbranch_scc0  label_0182                            // 0000000005FC: BF840002


label_0180:


  s_and_b64     exec, s[12:13], s[10:11]                // 000000000600: 87FE0A0C


  s_branch      label_0178                              // 000000000604: BF82FFF6


label_0182:


  s_mov_b64     exec, s[8:9]                            // 000000000608: BEFE0408


  v_cndmask_b32  v5, v5, 0, s[4:5]                      // 00000000060C: D2000005 00110105


  v_add_i32     v6, vcc, 48, v3                         // 000000000614: 4A0C06B0


  v_and_b32     v6, 63, v6                              // 000000000618: 360C0CBF


  v_add_i32     v2, vcc, v5, v2                         // 00000000061C: 4A040505


  v_cmp_lt_u32  s[4:5], v3, 16                          // 000000000620: D1820004 00012103


  v_lshl_b64    v[7:8], 1, v6                           // 000000000628: D2C20007 00020C81


  v_and_b32     v7, exec_lo, v7                         // 000000000630: 360E0E7E


  v_and_b32     v8, exec_hi, v8                         // 000000000634: 3610107F


  v_cmp_ne_u64  vcc, 0, v[7:8]                          // 000000000638: 7DCA0E80


  s_orn2_b64    s[4:5], s[4:5], vcc                     // 00000000063C: 8B846A04


  s_mov_b64     s[8:9], exec                            // 000000000640: BE88047E


  s_mov_b64     s[10:11], exec                          // 000000000644: BE8A047E


label_0192:


  v_readfirstlane_b32  s1, v6                           // 000000000648: 7E020506


  v_cmp_eq_u32  vcc, s1, v6                             // 00000000064C: 7D840C01


  s_and_saveexec_b64  s[12:13], vcc                     // 000000000650: BE8C246A


  v_readlane_b32  s1, v2, s1                            // 000000000654: 02020302


  s_cbranch_execz  label_019A                           // 000000000658: BF880003


  v_mov_b32     v5, s1                                  // 00000000065C: 7E0A0201


  s_andn2_b64   s[10:11], s[10:11], exec                // 000000000660: 8A8A7E0A


  s_cbranch_scc0  label_019C                            // 000000000664: BF840002


label_019A:


  s_and_b64     exec, s[12:13], s[10:11]                // 000000000668: 87FE0A0C


  s_branch      label_0192                              // 00000000066C: BF82FFF6


label_019C:


  s_mov_b64     exec, s[8:9]                            // 000000000670: BEFE0408


  v_cndmask_b32  v5, v5, 0, s[4:5]                      // 000000000674: D2000005 00110105


  v_add_i32     v6, vcc, 32, v3                         // 00000000067C: 4A0C06A0


  v_and_b32     v6, 63, v6                              // 000000000680: 360C0CBF


  v_add_i32     v2, vcc, v5, v2                         // 000000000684: 4A040505


  v_cmp_lt_u32  s[4:5], v3, 32                          // 000000000688: D1820004 00014103


  v_lshl_b64    v[7:8], 1, v6                           // 000000000690: D2C20007 00020C81


  v_and_b32     v7, exec_lo, v7                         // 000000000698: 360E0E7E


  v_and_b32     v8, exec_hi, v8                         // 00000000069C: 3610107F


  v_cmp_ne_u64  vcc, 0, v[7:8]                          // 0000000006A0: 7DCA0E80


  s_orn2_b64    s[4:5], s[4:5], vcc                     // 0000000006A4: 8B846A04


  s_mov_b64     s[8:9], exec                            // 0000000006A8: BE88047E


  s_mov_b64     s[10:11], exec                          // 0000000006AC: BE8A047E


label_01AC:


  v_readfirstlane_b32  s1, v6                           // 0000000006B0: 7E020506


  v_cmp_eq_u32  vcc, s1, v6                             // 0000000006B4: 7D840C01


  s_and_saveexec_b64  s[12:13], vcc                     // 0000000006B8: BE8C246A


  v_readlane_b32  s1, v2, s1                            // 0000000006BC: 02020302


  s_cbranch_execz  label_01B4                           // 0000000006C0: BF880003


  v_mov_b32     v5, s1                                  // 0000000006C4: 7E0A0201


  s_andn2_b64   s[10:11], s[10:11], exec                // 0000000006C8: 8A8A7E0A


  s_cbranch_scc0  label_01B6                            // 0000000006CC: BF840002


label_01B4:


  s_and_b64     exec, s[12:13], s[10:11]                // 0000000006D0: 87FE0A0C


  s_branch      label_01AC                              // 0000000006D4: BF82FFF6


label_01B6:


  s_mov_b64     exec, s[8:9]                            // 0000000006D8: BEFE0408


  v_cndmask_b32  v5, v5, 0, s[4:5]                      // 0000000006DC: D2000005 00110105


  s_add_u32     s1, s2, -1                              // 0000000006E4: 8001C102


  v_add_i32     v2, vcc, v5, v2                         // 0000000006E8: 4A040505


  v_mov_b32     v5, s1                                  // 0000000006EC: 7E0A0201


  v_lshl_b64    v[5:6], 1, v5                           // 0000000006F0: D2C20005 00020A81


  v_and_b32     v5, exec_lo, v5                         // 0000000006F8: 360A0A7E


  v_and_b32     v6, exec_hi, v6                         // 0000000006FC: 360C0C7F


  v_cmp_ne_u64  vcc, 0, v[5:6]                          // 000000000700: 7DCA0A80


  v_readlane_b32  s1, v2, s1                            // 000000000704: 02020302


  v_mov_b32     v2, s1                                  // 000000000708: 7E040201


  v_cndmask_b32  v7, 0, v2, vcc                         // 00000000070C: 000E0480


label_01C4:


  s_cmp_eq_i32  s0, 1                                   // 000000000710: BF008100


  s_cbranch_scc1  label_0320                            // 000000000714: BF85015A


  v_lshrrev_b32  v5, 6, v1                              // 000000000718: 2C0A0286


  v_and_b32     v1, 63, v1                              // 00000000071C: 360202BF


  v_cmp_ne_i32  s[4:5], v1, 0                           // 000000000720: D10A0004 00010101


  s_mov_b64     s[8:9], exec                            // 000000000728: BE88047E


  s_andn2_b64   exec, s[8:9], s[4:5]                    // 00000000072C: 8AFE0408


  v_lshlrev_b32  v6, 2, v5                              // 000000000730: 340C0A82


  ds_write_b32  v6, v7                                  // 000000000734: D8340000 00000706


  s_mov_b64     exec, s[8:9]                            // 00000000073C: BEFE0408


  s_waitcnt     lgkmcnt(0)                              // 000000000740: BF8C007F


  s_barrier                                             // 000000000744: BF8A0000


  v_cmp_eq_i32  vcc, 0, v5                              // 000000000748: 7D040A80


  s_and_b64     exec, s[8:9], vcc                       // 00000000074C: 87FE6A08


  v_cmp_gt_u32  vcc, s0, v1                             // 000000000750: 7D880200


  s_cbranch_execz  label_0318                           // 000000000754: BF880142


  s_and_saveexec_b64  s[0:1], vcc                       // 000000000758: BE80246A


  v_lshlrev_b32  v1, 2, v1                              // 00000000075C: 34020282


  ds_read_b32   v1, v1                                  // 000000000760: D8D80000 01000001


  s_andn2_b64   exec, s[0:1], exec                      // 000000000768: 8AFE7E00


  v_mov_b32     v1, 0                                   // 00000000076C: 7E020280


  s_mov_b64     exec, s[0:1]                            // 000000000770: BEFE0400


  v_mov_b32     v2, 1                                   // 000000000774: 7E040281


  v_cmp_eq_i32  vcc, v4, v2                             // 000000000778: 7D040504


  s_cbranch_vccz  label_026C                            // 00000000077C: BF86008C


  v_xor_b32     v2, 1, v3                               // 000000000780: 3A040681


  v_lshl_b64    v[4:5], 1, v2                           // 000000000784: D2C20004 00020481


  v_and_b32     v4, exec_lo, v4                         // 00000000078C: 3608087E


  v_and_b32     v5, exec_hi, v5                         // 000000000790: 360A0A7F


  v_cmp_ne_u64  s[0:1], 0, v[4:5]                       // 000000000794: D1CA0000 00020880


  s_mov_b64     s[2:3], exec                            // 00000000079C: BE82047E


  s_mov_b64     s[10:11], exec                          // 0000000007A0: BE8A047E


label_01E9:


  v_readfirstlane_b32  s12, v2                          // 0000000007A4: 7E180502


  v_cmp_eq_u32  vcc, s12, v2                            // 0000000007A8: 7D84040C


  s_and_saveexec_b64  s[14:15], vcc                     // 0000000007AC: BE8E246A


  s_waitcnt     lgkmcnt(0)                              // 0000000007B0: BF8C007F


  v_readlane_b32  s12, v1, s12                          // 0000000007B4: 02181901


  s_cbranch_execz  label_01F2                           // 0000000007B8: BF880003


  v_mov_b32     v2, s12                                 // 0000000007BC: 7E04020C


  s_andn2_b64   s[10:11], s[10:11], exec                // 0000000007C0: 8A8A7E0A


  s_cbranch_scc0  label_01F4                            // 0000000007C4: BF840002


label_01F2:


  s_mov_b64     exec, s[10:11]                          // 0000000007C8: BEFE040A


  s_branch      label_01E9                              // 0000000007CC: BF82FFF5


label_01F4:


  s_mov_b64     exec, s[2:3]                            // 0000000007D0: BEFE0402


  v_cndmask_b32  v2, 0, v2, s[0:1]                      // 0000000007D4: D2000002 00020480


  v_add_i32     v1, vcc, v2, v1                         // 0000000007DC: 4A020302


  v_xor_b32     v2, 2, v3                               // 0000000007E0: 3A040682


  v_lshl_b64    v[4:5], 1, v2                           // 0000000007E4: D2C20004 00020481


  v_and_b32     v4, exec_lo, v4                         // 0000000007EC: 3608087E


  v_and_b32     v5, exec_hi, v5                         // 0000000007F0: 360A0A7F


  v_cmp_ne_u64  s[0:1], 0, v[4:5]                       // 0000000007F4: D1CA0000 00020880


  s_mov_b64     s[2:3], exec                            // 0000000007FC: BE82047E


  s_mov_b64     s[10:11], exec                          // 000000000800: BE8A047E


label_0201:


  v_readfirstlane_b32  s12, v2                          // 000000000804: 7E180502


  v_cmp_eq_u32  vcc, s12, v2                            // 000000000808: 7D84040C


  s_and_saveexec_b64  s[14:15], vcc                     // 00000000080C: BE8E246A


  v_readlane_b32  s12, v1, s12                          // 000000000810: 02181901


  s_cbranch_execz  label_0209                           // 000000000814: BF880003


  v_mov_b32     v2, s12                                 // 000000000818: 7E04020C


  s_andn2_b64   s[10:11], s[10:11], exec                // 00000000081C: 8A8A7E0A


  s_cbranch_scc0  label_020B                            // 000000000820: BF840002


label_0209:


  s_mov_b64     exec, s[10:11]                          // 000000000824: BEFE040A


  s_branch      label_0201                              // 000000000828: BF82FFF6


label_020B:


  s_mov_b64     exec, s[2:3]                            // 00000000082C: BEFE0402


  v_cndmask_b32  v2, 0, v2, s[0:1]                      // 000000000830: D2000002 00020480


  v_add_i32     v1, vcc, v2, v1                         // 000000000838: 4A020302


  v_xor_b32     v2, 4, v3                               // 00000000083C: 3A040684


  v_lshl_b64    v[4:5], 1, v2                           // 000000000840: D2C20004 00020481


  v_and_b32     v4, exec_lo, v4                         // 000000000848: 3608087E


  v_and_b32     v5, exec_hi, v5                         // 00000000084C: 360A0A7F


  v_cmp_ne_u64  s[0:1], 0, v[4:5]                       // 000000000850: D1CA0000 00020880


  s_mov_b64     s[2:3], exec                            // 000000000858: BE82047E


  s_mov_b64     s[10:11], exec                          // 00000000085C: BE8A047E


label_0218:


  v_readfirstlane_b32  s12, v2                          // 000000000860: 7E180502


  v_cmp_eq_u32  vcc, s12, v2                            // 000000000864: 7D84040C


  s_and_saveexec_b64  s[14:15], vcc                     // 000000000868: BE8E246A


  v_readlane_b32  s12, v1, s12                          // 00000000086C: 02181901


  s_cbranch_execz  label_0220                           // 000000000870: BF880003


  v_mov_b32     v2, s12                                 // 000000000874: 7E04020C


  s_andn2_b64   s[10:11], s[10:11], exec                // 000000000878: 8A8A7E0A


  s_cbranch_scc0  label_0222                            // 00000000087C: BF840002


label_0220:


  s_mov_b64     exec, s[10:11]                          // 000000000880: BEFE040A


  s_branch      label_0218                              // 000000000884: BF82FFF6


label_0222:


  s_mov_b64     exec, s[2:3]                            // 000000000888: BEFE0402


  v_cndmask_b32  v2, 0, v2, s[0:1]                      // 00000000088C: D2000002 00020480


  v_add_i32     v1, vcc, v2, v1                         // 000000000894: 4A020302


  v_xor_b32     v2, 32, v3                              // 000000000898: 3A0406A0


  v_xor_b32     v4, 16, v3                              // 00000000089C: 3A080690


  v_xor_b32     v3, 8, v3                               // 0000000008A0: 3A060688


  v_lshl_b64    v[5:6], 1, v3                           // 0000000008A4: D2C20005 00020681


  v_and_b32     v5, exec_lo, v5                         // 0000000008AC: 360A0A7E


  v_and_b32     v6, exec_hi, v6                         // 0000000008B0: 360C0C7F


  v_cmp_ne_u64  s[0:1], 0, v[5:6]                       // 0000000008B4: D1CA0000 00020A80


  s_mov_b64     s[2:3], exec                            // 0000000008BC: BE82047E


  s_mov_b64     s[10:11], exec                          // 0000000008C0: BE8A047E


label_0231:


  v_readfirstlane_b32  s12, v3                          // 0000000008C4: 7E180503


  v_cmp_eq_u32  vcc, s12, v3                            // 0000000008C8: 7D84060C


  s_and_saveexec_b64  s[14:15], vcc                     // 0000000008CC: BE8E246A


  v_readlane_b32  s12, v1, s12                          // 0000000008D0: 02181901


  s_cbranch_execz  label_0239                           // 0000000008D4: BF880003


  v_mov_b32     v3, s12                                 // 0000000008D8: 7E06020C


  s_andn2_b64   s[10:11], s[10:11], exec                // 0000000008DC: 8A8A7E0A


  s_cbranch_scc0  label_023B                            // 0000000008E0: BF840002


label_0239:


  s_mov_b64     exec, s[10:11]                          // 0000000008E4: BEFE040A


  s_branch      label_0231                              // 0000000008E8: BF82FFF6


label_023B:


  s_mov_b64     exec, s[2:3]                            // 0000000008EC: BEFE0402


  v_cndmask_b32  v3, 0, v3, s[0:1]                      // 0000000008F0: D2000003 00020680


  v_add_i32     v1, vcc, v3, v1                         // 0000000008F8: 4A020303


  v_lshl_b64    v[5:6], 1, v4                           // 0000000008FC: D2C20005 00020881


  v_and_b32     v5, exec_lo, v5                         // 000000000904: 360A0A7E


  v_and_b32     v6, exec_hi, v6                         // 000000000908: 360C0C7F


  v_cmp_ne_u64  s[0:1], 0, v[5:6]                       // 00000000090C: D1CA0000 00020A80


  s_mov_b64     s[2:3], exec                            // 000000000914: BE82047E


  s_mov_b64     s[10:11], exec                          // 000000000918: BE8A047E


label_0247:


  v_readfirstlane_b32  s12, v4                          // 00000000091C: 7E180504


  v_cmp_eq_u32  vcc, s12, v4                            // 000000000920: 7D84080C


  s_and_saveexec_b64  s[14:15], vcc                     // 000000000924: BE8E246A


  v_readlane_b32  s12, v1, s12                          // 000000000928: 02181901


  s_cbranch_execz  label_024F                           // 00000000092C: BF880003


  v_mov_b32     v3, s12                                 // 000000000930: 7E06020C


  s_andn2_b64   s[10:11], s[10:11], exec                // 000000000934: 8A8A7E0A


  s_cbranch_scc0  label_0251                            // 000000000938: BF840002


label_024F:


  s_mov_b64     exec, s[10:11]                          // 00000000093C: BEFE040A


  s_branch      label_0247                              // 000000000940: BF82FFF6


label_0251:


  s_mov_b64     exec, s[2:3]                            // 000000000944: BEFE0402


  v_cndmask_b32  v3, 0, v3, s[0:1]                      // 000000000948: D2000003 00020680


  v_add_i32     v1, vcc, v3, v1                         // 000000000950: 4A020303


  v_lshl_b64    v[3:4], 1, v2                           // 000000000954: D2C20003 00020481


  v_and_b32     v3, exec_lo, v3                         // 00000000095C: 3606067E


  v_and_b32     v4, exec_hi, v4                         // 000000000960: 3608087F


  v_cmp_ne_u64  s[0:1], 0, v[3:4]                       // 000000000964: D1CA0000 00020680


  s_mov_b64     s[2:3], exec                            // 00000000096C: BE82047E


  s_mov_b64     s[10:11], exec                          // 000000000970: BE8A047E


label_025D:


  v_readfirstlane_b32  s12, v2                          // 000000000974: 7E180502


  v_cmp_eq_u32  vcc, s12, v2                            // 000000000978: 7D84040C


  s_and_saveexec_b64  s[14:15], vcc                     // 00000000097C: BE8E246A


  v_readlane_b32  s12, v1, s12                          // 000000000980: 02181901


  s_cbranch_execz  label_0265                           // 000000000984: BF880003


  v_mov_b32     v2, s12                                 // 000000000988: 7E04020C


  s_andn2_b64   s[10:11], s[10:11], exec                // 00000000098C: 8A8A7E0A


  s_cbranch_scc0  label_0267                            // 000000000990: BF840002


label_0265:


  s_mov_b64     exec, s[10:11]                          // 000000000994: BEFE040A


  s_branch      label_025D                              // 000000000998: BF82FFF6


label_0267:


  s_mov_b64     exec, s[2:3]                            // 00000000099C: BEFE0402


  v_cndmask_b32  v2, 0, v2, s[0:1]                      // 0000000009A0: D2000002 00020480


  v_add_i32     v1, vcc, v2, v1                         // 0000000009A8: 4A020302


  s_branch      label_0313                              // 0000000009AC: BF8200A7


label_026C:


  v_cmp_eq_i32  s[0:1], v3, 0                           // 0000000009B0: D1040000 00010103


  v_add_i32     v2, vcc, 63, v3                         // 0000000009B8: 4A0406BF


  v_and_b32     v2, 63, v2                              // 0000000009BC: 360404BF


  v_lshl_b64    v[4:5], 1, v2                           // 0000000009C0: D2C20004 00020481


  v_and_b32     v4, exec_lo, v4                         // 0000000009C8: 3608087E


  v_and_b32     v5, exec_hi, v5                         // 0000000009CC: 360A0A7F


  v_cmp_ne_u64  vcc, 0, v[4:5]                          // 0000000009D0: 7DCA0880


  s_orn2_b64    s[0:1], s[0:1], vcc                     // 0000000009D4: 8B806A00


  s_mov_b64     s[10:11], exec                          // 0000000009D8: BE8A047E


  s_mov_b64     s[12:13], exec                          // 0000000009DC: BE8C047E


label_0278:


  v_readfirstlane_b32  s3, v2                           // 0000000009E0: 7E060502


  v_cmp_eq_u32  vcc, s3, v2                             // 0000000009E4: 7D840403


  s_and_saveexec_b64  s[14:15], vcc                     // 0000000009E8: BE8E246A


  s_waitcnt     lgkmcnt(0)                              // 0000000009EC: BF8C007F


  v_readlane_b32  s3, v1, s3                            // 0000000009F0: 02060701


  s_cbranch_execz  label_0281                           // 0000000009F4: BF880003


  v_mov_b32     v2, s3                                  // 0000000009F8: 7E040203


  s_andn2_b64   s[12:13], s[12:13], exec                // 0000000009FC: 8A8C7E0C


  s_cbranch_scc0  label_0283                            // 000000000A00: BF840002


label_0281:


  s_mov_b64     exec, s[12:13]                          // 000000000A04: BEFE040C


  s_branch      label_0278                              // 000000000A08: BF82FFF5


label_0283:


  s_mov_b64     exec, s[10:11]                          // 000000000A0C: BEFE040A


  v_cndmask_b32  v2, v2, 0, s[0:1]                      // 000000000A10: D2000002 00010102


  v_add_i32     v1, vcc, v2, v1                         // 000000000A18: 4A020302


  v_cmp_lt_u32  s[0:1], v3, 2                           // 000000000A1C: D1820000 00010503


  v_add_i32     v2, vcc, 62, v3                         // 000000000A24: 4A0406BE


  v_and_b32     v2, 63, v2                              // 000000000A28: 360404BF


  v_lshl_b64    v[4:5], 1, v2                           // 000000000A2C: D2C20004 00020481


  v_and_b32     v4, exec_lo, v4                         // 000000000A34: 3608087E


  v_and_b32     v5, exec_hi, v5                         // 000000000A38: 360A0A7F


  v_cmp_ne_u64  vcc, 0, v[4:5]                          // 000000000A3C: 7DCA0880


  s_orn2_b64    s[0:1], s[0:1], vcc                     // 000000000A40: 8B806A00


  s_mov_b64     s[10:11], exec                          // 000000000A44: BE8A047E


  s_mov_b64     s[12:13], exec                          // 000000000A48: BE8C047E


label_0293:


  v_readfirstlane_b32  s3, v2                           // 000000000A4C: 7E060502


  v_cmp_eq_u32  vcc, s3, v2                             // 000000000A50: 7D840403


  s_and_saveexec_b64  s[14:15], vcc                     // 000000000A54: BE8E246A


  v_readlane_b32  s3, v1, s3                            // 000000000A58: 02060701


  s_cbranch_execz  label_029B                           // 000000000A5C: BF880003


  v_mov_b32     v2, s3                                  // 000000000A60: 7E040203


  s_andn2_b64   s[12:13], s[12:13], exec                // 000000000A64: 8A8C7E0C


  s_cbranch_scc0  label_029D                            // 000000000A68: BF840002


label_029B:


  s_mov_b64     exec, s[12:13]                          // 000000000A6C: BEFE040C


  s_branch      label_0293                              // 000000000A70: BF82FFF6


label_029D:


  s_mov_b64     exec, s[10:11]                          // 000000000A74: BEFE040A


  v_cndmask_b32  v2, v2, 0, s[0:1]                      // 000000000A78: D2000002 00010102


  v_add_i32     v1, vcc, v2, v1                         // 000000000A80: 4A020302


  v_cmp_lt_u32  s[0:1], v3, 4                           // 000000000A84: D1820000 00010903


  v_add_i32     v2, vcc, 60, v3                         // 000000000A8C: 4A0406BC


  v_and_b32     v2, 63, v2                              // 000000000A90: 360404BF


  v_lshl_b64    v[4:5], 1, v2                           // 000000000A94: D2C20004 00020481


  v_and_b32     v4, exec_lo, v4                         // 000000000A9C: 3608087E


  v_and_b32     v5, exec_hi, v5                         // 000000000AA0: 360A0A7F


  v_cmp_ne_u64  vcc, 0, v[4:5]                          // 000000000AA4: 7DCA0880


  s_orn2_b64    s[0:1], s[0:1], vcc                     // 000000000AA8: 8B806A00


  s_mov_b64     s[10:11], exec                          // 000000000AAC: BE8A047E


  s_mov_b64     s[12:13], exec                          // 000000000AB0: BE8C047E


label_02AD:


  v_readfirstlane_b32  s3, v2                           // 000000000AB4: 7E060502


  v_cmp_eq_u32  vcc, s3, v2                             // 000000000AB8: 7D840403


  s_and_saveexec_b64  s[14:15], vcc                     // 000000000ABC: BE8E246A


  v_readlane_b32  s3, v1, s3                            // 000000000AC0: 02060701


  s_cbranch_execz  label_02B5                           // 000000000AC4: BF880003


  v_mov_b32     v2, s3                                  // 000000000AC8: 7E040203


  s_andn2_b64   s[12:13], s[12:13], exec                // 000000000ACC: 8A8C7E0C


  s_cbranch_scc0  label_02B7                            // 000000000AD0: BF840002


label_02B5:


  s_mov_b64     exec, s[12:13]                          // 000000000AD4: BEFE040C


  s_branch      label_02AD                              // 000000000AD8: BF82FFF6


label_02B7:


  s_mov_b64     exec, s[10:11]                          // 000000000ADC: BEFE040A


  v_cndmask_b32  v2, v2, 0, s[0:1]                      // 000000000AE0: D2000002 00010102


  v_add_i32     v4, vcc, 56, v3                         // 000000000AE8: 4A0806B8


  v_add_i32     v1, vcc, v2, v1                         // 000000000AEC: 4A020302


  v_cmp_lt_u32  s[0:1], v3, 8                           // 000000000AF0: D1820000 00011103


  v_and_b32     v2, 63, v4                              // 000000000AF8: 360408BF


  v_lshl_b64    v[4:5], 1, v2                           // 000000000AFC: D2C20004 00020481


  v_and_b32     v4, exec_lo, v4                         // 000000000B04: 3608087E


  v_and_b32     v5, exec_hi, v5                         // 000000000B08: 360A0A7F


  v_cmp_ne_u64  vcc, 0, v[4:5]                          // 000000000B0C: 7DCA0880


  s_orn2_b64    s[0:1], s[0:1], vcc                     // 000000000B10: 8B806A00


  s_mov_b64     s[10:11], exec                          // 000000000B14: BE8A047E


  s_mov_b64     s[12:13], exec                          // 000000000B18: BE8C047E


label_02C7:


  v_readfirstlane_b32  s3, v2                           // 000000000B1C: 7E060502


  v_cmp_eq_u32  vcc, s3, v2                             // 000000000B20: 7D840403


  s_and_saveexec_b64  s[14:15], vcc                     // 000000000B24: BE8E246A


  v_readlane_b32  s3, v1, s3                            // 000000000B28: 02060701


  s_cbranch_execz  label_02CF                           // 000000000B2C: BF880003


  v_mov_b32     v2, s3                                  // 000000000B30: 7E040203


  s_andn2_b64   s[12:13], s[12:13], exec                // 000000000B34: 8A8C7E0C


  s_cbranch_scc0  label_02D1                            // 000000000B38: BF840002


label_02CF:


  s_mov_b64     exec, s[12:13]                          // 000000000B3C: BEFE040C


  s_branch      label_02C7                              // 000000000B40: BF82FFF6


label_02D1:


  s_mov_b64     exec, s[10:11]                          // 000000000B44: BEFE040A


  v_cndmask_b32  v2, v2, 0, s[0:1]                      // 000000000B48: D2000002 00010102


  v_add_i32     v4, vcc, 48, v3                         // 000000000B50: 4A0806B0


  v_and_b32     v4, 63, v4                              // 000000000B54: 360808BF


  v_add_i32     v1, vcc, v2, v1                         // 000000000B58: 4A020302


  v_cmp_lt_u32  s[0:1], v3, 16                          // 000000000B5C: D1820000 00012103


  v_lshl_b64    v[5:6], 1, v4                           // 000000000B64: D2C20005 00020881


  v_and_b32     v5, exec_lo, v5                         // 000000000B6C: 360A0A7E


  v_and_b32     v6, exec_hi, v6                         // 000000000B70: 360C0C7F


  v_cmp_ne_u64  vcc, 0, v[5:6]                          // 000000000B74: 7DCA0A80


  s_orn2_b64    s[0:1], s[0:1], vcc                     // 000000000B78: 8B806A00


  s_mov_b64     s[10:11], exec                          // 000000000B7C: BE8A047E


  s_mov_b64     s[12:13], exec                          // 000000000B80: BE8C047E


label_02E1:


  v_readfirstlane_b32  s3, v4                           // 000000000B84: 7E060504


  v_cmp_eq_u32  vcc, s3, v4                             // 000000000B88: 7D840803


  s_and_saveexec_b64  s[14:15], vcc                     // 000000000B8C: BE8E246A


  v_readlane_b32  s3, v1, s3                            // 000000000B90: 02060701


  s_cbranch_execz  label_02E9                           // 000000000B94: BF880003


  v_mov_b32     v2, s3                                  // 000000000B98: 7E040203


  s_andn2_b64   s[12:13], s[12:13], exec                // 000000000B9C: 8A8C7E0C


  s_cbranch_scc0  label_02EB                            // 000000000BA0: BF840002


label_02E9:


  s_mov_b64     exec, s[12:13]                          // 000000000BA4: BEFE040C


  s_branch      label_02E1                              // 000000000BA8: BF82FFF6


label_02EB:


  s_mov_b64     exec, s[10:11]                          // 000000000BAC: BEFE040A


  v_cndmask_b32  v2, v2, 0, s[0:1]                      // 000000000BB0: D2000002 00010102


  v_add_i32     v4, vcc, 32, v3                         // 000000000BB8: 4A0806A0


  v_and_b32     v4, 63, v4                              // 000000000BBC: 360808BF


  v_add_i32     v1, vcc, v2, v1                         // 000000000BC0: 4A020302


  v_cmp_lt_u32  s[0:1], v3, 32                          // 000000000BC4: D1820000 00014103


  v_lshl_b64    v[2:3], 1, v4                           // 000000000BCC: D2C20002 00020881


  v_and_b32     v2, exec_lo, v2                         // 000000000BD4: 3604047E


  v_and_b32     v3, exec_hi, v3                         // 000000000BD8: 3606067F


  v_cmp_ne_u64  vcc, 0, v[2:3]                          // 000000000BDC: 7DCA0480


  s_orn2_b64    s[0:1], s[0:1], vcc                     // 000000000BE0: 8B806A00


  s_mov_b64     s[10:11], exec                          // 000000000BE4: BE8A047E


  s_mov_b64     s[12:13], exec                          // 000000000BE8: BE8C047E


label_02FB:


  v_readfirstlane_b32  s3, v4                           // 000000000BEC: 7E060504


  v_cmp_eq_u32  vcc, s3, v4                             // 000000000BF0: 7D840803


  s_and_saveexec_b64  s[14:15], vcc                     // 000000000BF4: BE8E246A


  v_readlane_b32  s3, v1, s3                            // 000000000BF8: 02060701


  s_cbranch_execz  label_0303                           // 000000000BFC: BF880003


  v_mov_b32     v2, s3                                  // 000000000C00: 7E040203


  s_andn2_b64   s[12:13], s[12:13], exec                // 000000000C04: 8A8C7E0C


  s_cbranch_scc0  label_0305                            // 000000000C08: BF840002


label_0303:


  s_mov_b64     exec, s[12:13]                          // 000000000C0C: BEFE040C


  s_branch      label_02FB                              // 000000000C10: BF82FFF6


label_0305:


  s_mov_b64     exec, s[10:11]                          // 000000000C14: BEFE040A


  v_cndmask_b32  v2, v2, 0, s[0:1]                      // 000000000C18: D2000002 00010102


  s_add_u32     s0, s2, -1                              // 000000000C20: 8000C102


  v_add_i32     v1, vcc, v2, v1                         // 000000000C24: 4A020302


  v_mov_b32     v2, s0                                  // 000000000C28: 7E040200


  v_lshl_b64    v[2:3], 1, v2                           // 000000000C2C: D2C20002 00020481


  v_and_b32     v2, exec_lo, v2                         // 000000000C34: 3604047E


  v_and_b32     v3, exec_hi, v3                         // 000000000C38: 3606067F


  v_cmp_ne_u64  vcc, 0, v[2:3]                          // 000000000C3C: 7DCA0480


  v_readlane_b32  s0, v1, s0                            // 000000000C40: 02000101


  v_mov_b32     v1, s0                                  // 000000000C44: 7E020200


  v_cndmask_b32  v1, 0, v1, vcc                         // 000000000C48: 00020280


label_0313:


  s_mov_b64     s[0:1], exec                            // 000000000C4C: BE80047E


  s_andn2_b64   exec, s[0:1], s[4:5]                    // 000000000C50: 8AFE0400


  v_mov_b32     v2, 0                                   // 000000000C54: 7E040280


  ds_write_b32  v2, v1                                  // 000000000C58: D8340000 00000102


label_0318:


  s_mov_b64     exec, s[8:9]                            // 000000000C60: BEFE0408


  s_waitcnt     lgkmcnt(0)                              // 000000000C64: BF8C007F


  s_barrier                                             // 000000000C68: BF8A0000


  v_mov_b32     v1, 0                                   // 000000000C6C: 7E020280


  ds_read_b32   v7, v1                                  // 000000000C70: D8D80000 07000001


  s_waitcnt     lgkmcnt(0)                              // 000000000C78: BF8C007F


  s_barrier                                             // 000000000C7C: BF8A0000


label_0320:


  v_cmp_eq_i32  vcc, 0, v0                              // 000000000C80: 7D040080


  s_and_saveexec_b64  s[0:1], vcc                       // 000000000C84: BE80246A


  s_cbranch_execz  label_0329                           // 000000000C88: BF880006


  s_load_dwordx2  s[2:3], s[6:7], 0x0c                  // 000000000C8C: C041070C


  s_waitcnt     lgkmcnt(0)                              // 000000000C90: BF8C007F


  v_mov_b32     v2, s2                                  // 000000000C94: 7E040202


  v_mov_b32     v3, s3                                  // 000000000C98: 7E060203


  flat_atomic_add  v0, v[2:3], v7                       // 000000000C9C: DCC80000 00000702


label_0329:


  s_endpgm                                              // 000000000CA4: BF810000


end


; ----------------- CS Data ------------------------




codeLenInByte        = 3752 bytes;


objectHeaderByteSize = 512 bytes;




; launchModeFlags    = 0x00000200


isAql                = TRUE;




userElementCount     = 0;


extUserElementCount  = 0;


NumVgprs             = 10;


NumSgprs             = 23;


FloatMode            = 192;


IeeeMode             = 0;


FlatPtr32            = 0;


ScratchSize          = 0 dwords/thread;


LDSByteSize          = 32 bytes/workgroup (compile time only);


ScratchWaveOffsetReg = s65535;


; texSamplerUsage         = 0x00000000


; constBufUsage           = 0x00000000




;COMPUTE_PGM_RSRC2       = 0x00009390


COMPUTE_PGM_RSRC2:USER_SGPR      = 8


COMPUTE_PGM_RSRC2:TGID_X_EN      = 1


COMPUTE_PGM_RSRC2:TGID_Y_EN      = 1


COMPUTE_PGM_RSRC2:TGID_Z_EN      = 1


COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT = 2


COMPUTE_PGM_RSRC2:LDS_SIZE       = 1


; ---------------------------- HSA Code Descriptor ----------------------------


code_type                                      = HSA_EXT_CODE_KERNEL


workgroup_group_segment_byte_size              = 32


kernarg_segment_byte_size                      = 56


code.handle                                    = 256%


kernarg_segment_alignment                      = 3


group_segment_alignment                        = 3


wavefront_size                                 = 6


program_call_convention                        = 0


module.handle                                  = 0


symbol                                         = 3932


hsail_profile                                  = HSA_EXT_BRIG_PROFILE_FULL


hsail_machine_model                            = HSA_EXT_BRIG_MACHINE_LARGE


debug_information.handle                       = 0


agent_vendor                                   = AMD


agent_name                                     = gfx7


hsail_version_major                            = 0


hsail_version_minor                            = 20140528


; ---------------------------- AMD Code Descriptor ----------------------------


amd_code_version_major                         = 0


amd_code_version_minor                         = 0


struct_byte_size                               = 256


target_chip                                    = gfx7


kernel_code_entry_byte_offset                  = 256


max_scratch_backing_memory_byte_size           = 0


; compute_pgm_rsrc1:


  granulated_workitem_vgpr_count                       = 2


  granulated_wavefront_sgpr_count                      = 3


  priority                                             = 0


  float_mode_round_32                                  = ROUND_TO_NEAREST_EVEN


  float_mode_round_16_64                               = ROUND_TO_NEAREST_EVEN


  float_mode_denorm_32                                 = FLUSH_SOURCE_OUTPUT


  float_mode_denorm_16_64                              = FLUSH_NONE


; compute_pgm_rsrc2:


  user_sgpr_count                                      = 8


  enable_sgpr_workgroup_id_x                           = TRUE


  enable_sgpr_workgroup_id_y                           = TRUE


  enable_sgpr_workgroup_id_z                           = TRUE


  enable_vgpr_workitem_id                              = X, Y, Z


  granulated_lds_size                                  = 0


enable_sgpr_private_segment_buffer             = TRUE


enable_sgpr_dispatch_ptr                       = TRUE


enable_sgpr_kernarg_segment_ptr                = TRUE


private_element_size                           = DWORD (4 bytes)


is_ptr64                                       = TRUE


gds_segment_byte_size                          = 0


wavefront_sgpr_count                           = 23


workitem_vgpr_count                            = 10


kernarg_segment_byte_size                      = 56


workgroup_group_segment_byte_size              = 32


kernarg_segment_alignment                      = 3


group_segment_alignment                        = 3


code_alignment                                 = 3


code_type                                      = HSA_EXT_CODE_KERNEL


wavefront_size                                 = 6


optimization_level                             = 2


hsail_profile                                  = HSA_EXT_BRIG_PROFILE_FULL


hsail_machine_model                            = HSA_EXT_BRIG_MACHINE_LARGE


hsail_version_major                            = 0


hsail_version_minor                            = 20140528


Such long code leads to low performance reductions. What is the actual cause of all these instructions? Does it have to do with the supporting ISA of Bonaire? Does it support the so called "swizzle operations" for exchanging data between lanes in the same wavefront? Or is it just a driver bug/limitation?

0 Likes

Thank you. Just wanted to let you know that we are looking into it and will get back to you.

Regards,

Ravi

0 Likes

Here is an update. A internal bug report has been filed against this issue. Will keep you updated. Thanks for your patience.

0 Likes

Great!

I also tried with the latest driver (15.7). I have to report that though the performance has been improved it does not match the shared memory version. On a Bonaire GPU using partially shared memory and subgroup functions exhibits performance of about 67% of the shared memory only and using the workgroup function shows performance of about a third (34%) of the shared memory only version.

Workgroup and sub-workgroup OpenCL 2.0 function evaluation test case

Platform/Device selection

Total platforms: 2

1. NVIDIA CUDA

        1. GeForce GTX 660/NVIDIA Corporation

2. AMD Accelerated Parallel Processing

        1. Bonaire/Advanced Micro Devices, Inc.

        2. Intel(R) Pentium(R) 4 CPU 3.06GHz/GenuineIntel

Select platform index: 2

Select device index: 1

Device info

Platform:       AMD Accelerated Parallel Processing

Device:         Bonaire

Driver version: 1800.5 (VM)

OpenCL version: OpenCL 2.0 AMD-APP (1800.5)

Great! OpenCL 2.0 is supported 🙂

Building kernel with options "-cl-std=CL2.0 -cl-uniform-work-group-size -DK3 -DK2 -DWAVEFRONT_SIZE=64"

1. Shared memory only kernel

Executing...Done!

Output: 4294901760 / Time: 0.0706873 msecs (1.85425 billion elements/second)

PASSED!

2. Hybrid kernel via subgroup functions

Executing...Done!

Output: 4294901760 / Time: 0.105353 msecs (1.24413 billion elements/second)

Relative speed-up with respect to kernel 1: 0.67096

PASSED!

3. Workgroup function kernel

Executing...Done!

Output: 4294901760 / Time: 0.203889 msecs (0.64286 billion elements/second)

Relative speed-up with respect to kernel 1: 0.346695

PASSED!

0 Likes
pszilard
Adept I

Let me revive this thread. I would have great use of lane shuffle operations exposed, but to most purposes the OpenCL 2.0 workgroup reduction (and some voting ops) would be enough. However, i'd have to be able to trust the implementation do the right thing.

Except that even though the OpenCL 2.0 driver is (officially) out of beta, the workgroup funtions' implementation in the OpenCL 2.0 compiler/runtime seems to still be rather poor. With the OPs microbenchmark, for some workload sizes the "hybrid" reduction can be up to 20% faster on Fiji, but for other sizes it's slower. On Fiji the pure wg version seemed to always be 1.5-2x slower than the shmem-based reduction.

On Hawaii I get very poor performance, the hybrid version is 2-4x slower, the pure WG version 6-10x slower.

I had a hunch and tweaked the hybrid version to let the local memory-based reduction run down to 32 and 16 and do sub_group_reduce_add() across half and quarter vawefront only. Interestingly, this improved the performance on Hawaii; best case I got nearly 2x with sub_group_reduce across a qurter wave on Hawaii with large global work sizes (making Hawaii hierarchical reduction best case 10% slower than local mem-based). On Fiji the same tweaks reduced performance quite a bit.

What's strange is that explicit unrolling of the local mem reduction loop causes the performance to regress. Any ideas why?

All in all, my results show a pretty uncomfortable reality: the advantages of the OCL 2.0 wg functions are greatly diminished by the quality of the compiler / runtime.

0 Likes

According to my experiments on a 3rd generation GCN GPU (Tonga) I confirm that subgroup reductions provided slightly improved performance compared to shared mem reduction. However, workgroup reduction is always slower. Maybe workgroup data sharing is not effective without the use of shared memory. Furthermore, using a 2nd gen GCN GPU (Bonaire) it seems it is always faster to use shared memory reductions and by a wider margin.

OpenCL is about performance and getting slower performance by using new functions practically makes them meaningless.

0 Likes

Thanks for the feedback! I guess you used the default global work size hard-coded in the source, right? For a thorough tests it would make sense to generate performance as a function of work size (I did see a slight dependence). However, as long as the implementation (and/or docs) does not improve, the current feedback should be enough incentive for AMD to take action.

Have you tried the subgroup reduction across half or quarter wave? That improved performance for me on Hawaii.

0 Likes

Both workgroup and NDrange sizes are hardcoded in the source (256 and 256x64x8=131072, respectively). In order to keep it simple I didn't apply any scaling on the NDRange, just a fixed one. At least this performance behavior is not intolerable on GCN 3rd gen devices.

I guess that partial reductions on wave are not valid. In paragraph "9.17.3.4" of the opencl-2.0-extensions manual by Khronos it states:

The OpenCL C programming language implements the following built-in functions that operate on a sub-group level. These built-in functions must be encountered by all work-items in a subgroup executing the kernel.

These functions include subgroup reductions.

0 Likes

This issue has not escaped our attention, but there is a great deal of competition for that attention.  Also, we don't have quite the flexibility of implementation in the library as do OpenCL kernel developers.  Finally, cross-lane support is not very strong prior to "Volcanic Islands" ASICs.

We are planning to improve our implementation in an upcoming release.

0 Likes

Thank you for your response.

The experimental results I've got are consistent with your statement on VI ASICs.

Any improvement on the efficiency of workgroup/subgroup functions is welcome.

0 Likes

> Finally, cross-lane support is not very strong prior to "Volcanic Islands" ASICs.

Could you elaborate what does that mean, e.g. in terms of bytes/clock shifted across SIMD lanes? I can't find much information except some slides that were floating around on the Internet (can't find them right now) that claimed improved lane shuffle in Fiji, I think.

> We are planning to improve our implementation in an upcoming release.

Improving workgroup reduction to make it at least match the local-memory-based implementations is in my opinion important if not deal-breaker to adoption.

However, in my opinion, you should strongly consider exposing ASAP cross-lane operations through a CL extension. This is a primitive that can allow elegant implementation of many algorithms/parallel primitives. Given how great has been the reception of "warp shuffle" introduced by NV 2 gen ago, I expect that your dev community would applaud and adopt the feature quickly.

0 Likes