16 Replies Latest reply on Mar 31, 2016 2:26 PM by pszilard

    Poor workgroup reduction function performance (OpenCL 2.0)

    ekondis

      I had previously reported this but here I provide a test case for the examination of workgroup reduction function. Kernels perform a workgroup reduction in 3 ways:

      1) The classical one with shared memory (OpenCL 1.2)

      2) Shared memory plus sub-group reduction function on the final stage

      3) Workgroup reduction function (no shared memory at all)

       

      I tested it on a R7-260X and the latter two kernels prove to be significantly slower than reduction in shared memory. The last one especially is more than 5 times slower than using pure shared memory. This fact eliminates the value of these new functions in OpenCL 2.0. AFAIK, GCN GPUs feature swizzle operations which would could potentially make workgroup functions quite efficient. This is not the case however.

       

      In addition the CodeXL 1.6 static kernel analyser does not support OpenCL 2.0 kernels. Therefore, I cannot investigate the disassembled kernel code.

       

      Code on Github: https://github.com/ekondis/cl2-reduce-bench

      More details on blog: http://parallelplusplus.blogspot.gr/2014/12/workgroup-reduction-function-evaluation.html

        • Re: Poor workgroup reduction function performance (OpenCL 2.0)
          dipak

          Thanks for sharing the testcase. We'll check and get back to you.

           

          Regards,

          • Re: Poor workgroup reduction function performance (OpenCL 2.0)
            ekondis

            I usually don't trust any measurement taking less than 4-5 ms. On my system, those are normal fluctuations! I assume using profile mode takes care of that...

            ... I'm pretty sure CL profile mode has been invented to be used, I don't trust it. As most APIs have historically been easygoing on the profile data, I wouldn't take for granted the results are coherent...

            besides I cannot exclude my app had a small overhead with profile on.

             

            While the problem of reductions might be useful for some cases, I don't see value in a kernel doing only this, it's not relevant to me. In general, I've noticed using advanced functionality makes simple kernels slower (I have been told they take more time to "set up") while making complex kernels usually faster. This kernel is super simple, its only memory access is in writing out the result. There's no other workload and this severely hampers HW ability to do things optimally.

            The results seem to be quite stable. No significant fluctuations between executions have been observed. Even if profiling contributes extra latency it should be mostly identical on all kernel executions. And it's a comparison between the three executions after all.

             

            The kernel is deliberately simple as I want to expose the performance of the specific workgroup function and not to include other factors that might affect performance. The three kernels differ only in the reduction part since it is the focus of the comparison.

             

            Just to mention that I've seen in a few known GPU benchmarks kernel executions that take just a couple of milliseconds (e.g. Rodinia).

            • Re: Poor workgroup reduction function performance (OpenCL 2.0)
              ekondis

              Unfortunately, I still haven't got any reply on the issue.

               

              However, I got a reminder note about saving temporary files with the "-save-temps" compilation option, so I was able to get the disassembly code of all 3 kernels.

               

              The results are quite revealing of the computational cost each kernel. I'll focus on the last kernel which makes use of the workgroup reduction function. The kernel's code is quite simple:

               

              __kernel void reductionWkgrp(__global uint *result) {
                const uint id = get_global_id(0);
                const uint lid = get_local_id(0);
                uint res = id;
                // workgroup reduction (introduced in OpenCL 2.0)
                res = work_group_reduce_add(res);
                // atomic reduce in global memory
                if(lid==0)
                  atomic_add(result, res);
              }
              

               

              The generated disassembly code on a Bonaire GPU is enormous:

              ShaderType = IL_SHADER_COMPUTE
              TargetChip = t
              ; ------------- SC_SRCSHADER Dump ------------------
              SC_SHADERSTATE: u32NumIntVSConst = 0
              SC_SHADERSTATE: u32NumIntPSConst = 0
              SC_SHADERSTATE: u32NumIntGSConst = 0
              SC_SHADERSTATE: u32NumBoolVSConst = 0
              SC_SHADERSTATE: u32NumBoolPSConst = 0
              SC_SHADERSTATE: u32NumBoolGSConst = 0
              SC_SHADERSTATE: u32NumFloatVSConst = 0
              SC_SHADERSTATE: u32NumFloatPSConst = 0
              SC_SHADERSTATE: u32NumFloatGSConst = 0
              u32SCOptions[0] = 0x00680000 SCOption_IGNORE_SAMPLE_L_BUG SCOption_FLOAT_DO_NOT_DIST SCOption_FLOAT_DO_NOT_REASSOC
              u32SCOptions[1] = 0x00000000
              u32SCOptions[2] = 0x08000000 SCOption_R1000_READLANE_SMRD_WORKAROUND_BUG343479
              u32SCOptions[3] = 0x00001284 SCOption_R1000_BARRIER_WORKAROUND_BUG405404 SCOption_R1000_IFORK_BACKJUMP_WORKAROUND_BUG368004 SCOption_R1000R1100_VCCZ_CLOBBER_WORKAROUND_BUG457939 SCOption_R1100_FLAT_SCRATCH_OOB_WORKAROUND_BUG379895
              u32SCOptions[4] = 0x00000000
              ; -------- Disassembly --------------------
              shader main
                asic(CI)
                type(CS)
              
              
                s_load_dwordx4  s[12:15], s[4:5], 0x01                // 000000000000: C0860501
                s_mov_b32     m0, 0x00010000                          // 000000000004: BEFC03FF 00010000
                s_waitcnt     lgkmcnt(0)                              // 00000000000C: BF8C007F
                s_bfe_u32     s11, s12, 0x00100000                    // 000000000010: 938BFF0C 00100000
                s_bfe_u32     s12, s12, 0x00100010                    // 000000000018: 938CFF0C 00100010
                s_bfe_u32     s13, s13, 0x00100000                    // 000000000020: 938DFF0D 00100000
                v_mul_lo_u32  v2, v2, s12                             // 000000000028: D2D20002 00001902
                v_add_i32     v1, vcc, v2, v1                         // 000000000030: 4A020302
                v_mul_lo_u32  v1, v1, s11                             // 000000000034: D2D20001 00001701
                v_add_i32     v1, vcc, v1, v0                         // 00000000003C: 4A020101
                s_cmp_eq_u32  0, s11                                  // 000000000040: BF060B80
                s_cselect_b32  s0, -1, 0                              // 000000000044: 850080C1
                s_cmp_eq_u32  0, s11                                  // 000000000048: BF060B80
                s_cbranch_scc1  label_0030                            // 00000000004C: BF85001C
                s_cmp_gt_u32  s11, s14                                // 000000000050: BF080E0B
                s_cbranch_scc1  label_0030                            // 000000000054: BF85001A
                s_flbit_i32_b32  s1, s11                              // 000000000058: BE81150B
                s_flbit_i32_b32  s2, s14                              // 00000000005C: BE82150E
                s_sub_i32     s20, s1, s2                             // 000000000060: 81940201
                s_lshl_b32    s1, s11, s20                            // 000000000064: 8F01140B
                s_sub_u32     s17, s14, s1                            // 000000000068: 8091010E
                s_cselect_b32  s21, s14, s17                          // 00000000006C: 8515110E
                s_cselect_b32  s18, 0, 1                              // 000000000070: 85128180
                s_add_u32     s3, s0, s18                             // 000000000074: 80031200
                s_cmp_eq_u32  0, s20                                  // 000000000078: BF061480
                s_cbranch_scc0  label_0022                            // 00000000007C: BF840002
                s_mov_b32     s0, s3                                  // 000000000080: BE800303
                s_branch      label_0030                              // 000000000084: BF82000E
              label_0022:
                s_mov_b32     s2, s3                                  // 000000000088: BE820303
              label_0023:
                s_add_u32     s2, s2, s2                              // 00000000008C: 80020202
                s_lshr_b32    s1, s1, 1                               // 000000000090: 90018101
                s_sub_u32     s18, s21, s1                            // 000000000094: 80920115
                s_cselect_b32  s17, s21, s18                          // 000000000098: 85111215
                s_cselect_b32  s19, 0, 1                              // 00000000009C: 85138180
                s_add_u32     s16, s2, s19                            // 0000000000A0: 80101302
                s_addk_i32    s20, 0xffff                             // 0000000000A4: B794FFFF
                s_cmp_le_u32  s20, 0                                  // 0000000000A8: BF0B8014
                s_cbranch_scc1  label_002F                            // 0000000000AC: BF850003
                s_mov_b32     s2, s16                                 // 0000000000B0: BE820310
                s_mov_b32     s21, s17                                // 0000000000B4: BE950311
                s_branch      label_0023                              // 0000000000B8: BF82FFF4
              label_002F:
                s_mov_b32     s0, s16                                 // 0000000000BC: BE800310
              label_0030:
                s_mul_i32     s1, s0, s11                             // 0000000000C0: 93010B00
                s_sub_i32     s1, s14, s1                             // 0000000000C4: 8181010E
                s_cmp_eq_u32  0, s12                                  // 0000000000C8: BF060C80
                s_cselect_b32  s2, -1, 0                              // 0000000000CC: 850280C1
                s_cmp_eq_u32  0, s12                                  // 0000000000D0: BF060C80
                s_cbranch_scc1  label_0052                            // 0000000000D4: BF85001C
                s_cmp_gt_u32  s12, s15                                // 0000000000D8: BF080F0C
                s_cbranch_scc1  label_0052                            // 0000000000DC: BF85001A
                s_flbit_i32_b32  s3, s12                              // 0000000000E0: BE83150C
                s_flbit_i32_b32  s14, s15                             // 0000000000E4: BE8E150F
                s_sub_i32     s21, s3, s14                            // 0000000000E8: 81950E03
                s_lshl_b32    s3, s12, s21                            // 0000000000EC: 8F03150C
                s_sub_u32     s18, s15, s3                            // 0000000000F0: 8092030F
                s_cselect_b32  s22, s15, s18                          // 0000000000F4: 8516120F
                s_cselect_b32  s19, 0, 1                              // 0000000000F8: 85138180
                s_add_u32     s16, s2, s19                            // 0000000000FC: 80101302
                s_cmp_eq_u32  0, s21                                  // 000000000100: BF061580
                s_cbranch_scc0  label_0044                            // 000000000104: BF840002
                s_mov_b32     s2, s16                                 // 000000000108: BE820310
                s_branch      label_0052                              // 00000000010C: BF82000E
              label_0044:
                s_mov_b32     s14, s16                                // 000000000110: BE8E0310
              label_0045:
                s_add_u32     s14, s14, s14                           // 000000000114: 800E0E0E
                s_lshr_b32    s3, s3, 1                               // 000000000118: 90038103
                s_sub_u32     s19, s22, s3                            // 00000000011C: 80930316
                s_cselect_b32  s18, s22, s19                          // 000000000120: 85121316
                s_cselect_b32  s20, 0, 1                              // 000000000124: 85148180
                s_add_u32     s17, s14, s20                           // 000000000128: 8011140E
                s_addk_i32    s21, 0xffff                             // 00000000012C: B795FFFF
                s_cmp_le_u32  s21, 0                                  // 000000000130: BF0B8015
                s_cbranch_scc1  label_0051                            // 000000000134: BF850003
                s_mov_b32     s14, s17                                // 000000000138: BE8E0311
                s_mov_b32     s22, s18                                // 00000000013C: BE960312
                s_branch      label_0045                              // 000000000140: BF82FFF4
              label_0051:
                s_mov_b32     s2, s17                                 // 000000000144: BE820311
              label_0052:
                s_mul_i32     s3, s2, s12                             // 000000000148: 93030C02
                s_sub_i32     s3, s15, s3                             // 00000000014C: 8183030F
                s_cmp_eq_u32  0, s13                                  // 000000000150: BF060D80
                s_cselect_b32  s17, -1, 0                             // 000000000154: 851180C1
                s_load_dword  s4, s[4:5], 0x05                        // 000000000158: C0020505
                s_cmp_eq_u32  0, s13                                  // 00000000015C: BF060D80
                s_cbranch_scc1  label_0075                            // 000000000160: BF85001C
                s_waitcnt     lgkmcnt(0)                              // 000000000164: BF8C007F
                s_cmp_gt_u32  s13, s4                                 // 000000000168: BF08040D
                s_cbranch_scc1  label_0075                            // 00000000016C: BF850019
                s_flbit_i32_b32  s5, s13                              // 000000000170: BE85150D
                s_flbit_i32_b32  s15, s4                              // 000000000174: BE8F1504
                s_sub_i32     s5, s5, s15                             // 000000000178: 81850F05
                s_lshl_b32    s21, s13, s5                            // 00000000017C: 8F15050D
                s_sub_u32     s18, s4, s21                            // 000000000180: 80921504
                s_cselect_b32  s22, s4, s18                           // 000000000184: 85161204
                s_cselect_b32  s19, 0, 1                              // 000000000188: 85138180
                s_add_u32     s16, s17, s19                           // 00000000018C: 80101311
                s_cmp_eq_u32  0, s5                                   // 000000000190: BF060580
                s_cbranch_scc0  label_0068                            // 000000000194: BF840002
                s_mov_b32     s17, s16                                // 000000000198: BE910310
                s_branch      label_0075                              // 00000000019C: BF82000D
              label_0068:
                s_mov_b32     s15, s16                                // 0000000001A0: BE8F0310
              label_0069:
                s_add_u32     s15, s15, s15                           // 0000000001A4: 800F0F0F
                s_lshr_b32    s21, s21, 1                             // 0000000001A8: 90158115
                s_sub_u32     s19, s22, s21                           // 0000000001AC: 80931516
                s_cselect_b32  s18, s22, s19                          // 0000000001B0: 85121316
                s_cselect_b32  s20, 0, 1                              // 0000000001B4: 85148180
                s_add_u32     s17, s15, s20                           // 0000000001B8: 8011140F
                s_addk_i32    s5, 0xffff                              // 0000000001BC: B785FFFF
                s_cmp_le_u32  s5, 0                                   // 0000000001C0: BF0B8005
                s_cbranch_scc1  label_0075                            // 0000000001C4: BF850003
                s_mov_b32     s15, s17                                // 0000000001C8: BE8F0311
                s_mov_b32     s22, s18                                // 0000000001CC: BE960312
                s_branch      label_0069                              // 0000000001D0: BF82FFF4
              label_0075:
                s_mul_i32     s14, s17, s13                           // 0000000001D4: 930E0D11
                s_waitcnt     lgkmcnt(0)                              // 0000000001D8: BF8C007F
                s_sub_i32     s4, s4, s14                             // 0000000001DC: 81840E04
                s_cmp_lt_u32  s8, s0                                  // 0000000001E0: BF0A0008
                s_cselect_b32  s0, s11, s1                            // 0000000001E4: 8500010B
                s_cmp_lt_u32  s9, s2                                  // 0000000001E8: BF0A0209
                s_cselect_b32  s1, s12, s3                            // 0000000001EC: 8501030C
                s_cmp_lt_u32  s10, s17                                // 0000000001F0: BF0A110A
                s_cselect_b32  s2, s13, s4                            // 0000000001F4: 8502040D
                s_mul_i32     s3, s11, s8                             // 0000000001F8: 9303080B
                s_load_dwordx2  s[4:5], s[6:7], 0x00                  // 0000000001FC: C0420700
                v_cmp_ne_u32  s[8:9], 1, 0                            // 000000000200: D18A0008 00010081
                s_mul_i32     s0, s1, s0                              // 000000000208: 93000001
                s_mul_i32     s0, s0, s2                              // 00000000020C: 93000200
                s_waitcnt     lgkmcnt(0)                              // 000000000210: BF8C007F
                s_add_u32     s1, s3, s4                              // 000000000214: 80010403
                s_add_u32     s0, s0, 63                              // 000000000218: 8000BF00
                s_bcnt1_i32_b64  s2, s[8:9]                           // 00000000021C: BE821008
                v_add_i32     v2, vcc, s1, v0                         // 000000000220: 4A040001
                s_lshr_b32    s0, s0, 6                               // 000000000224: 90008600
                v_mbcnt_lo_u32_b32  v3, -1, 0                         // 000000000228: D2460003 000100C1
                v_mbcnt_hi_u32_b32  v3, -1, v3 vop3                   // 000000000230: D2480003 000206C1
                v_bcnt_u32_b32  v4, s2, 0                             // 000000000238: D2440004 00010002
                v_mov_b32     v5, 1                                   // 000000000240: 7E0A0281
                v_cmp_eq_i32  vcc, v4, v5                             // 000000000244: 7D040B04
                s_cbranch_vccz  label_011E                            // 000000000248: BF86008B
                v_xor_b32     v5, 1, v3                               // 00000000024C: 3A0A0681
                v_lshl_b64    v[6:7], 1, v5                           // 000000000250: D2C20006 00020A81
                v_and_b32     v6, exec_lo, v6                         // 000000000258: 360C0C7E
                v_and_b32     v7, exec_hi, v7                         // 00000000025C: 360E0E7F
                v_cmp_ne_u64  s[4:5], 0, v[6:7]                       // 000000000260: D1CA0004 00020C80
                s_mov_b64     s[8:9], exec                            // 000000000268: BE88047E
                s_mov_b64     s[10:11], exec                          // 00000000026C: BE8A047E
              label_009C:
                v_readfirstlane_b32  s1, v5                           // 000000000270: 7E020505
                v_cmp_eq_u32  vcc, s1, v5                             // 000000000274: 7D840A01
                s_and_saveexec_b64  s[12:13], vcc                     // 000000000278: BE8C246A
                v_readlane_b32  s1, v2, s1                            // 00000000027C: 02020302
                s_cbranch_execz  label_00A4                           // 000000000280: BF880003
                v_mov_b32     v5, s1                                  // 000000000284: 7E0A0201
                s_andn2_b64   s[10:11], s[10:11], exec                // 000000000288: 8A8A7E0A
                s_cbranch_scc0  label_00A6                            // 00000000028C: BF840002
              label_00A4:
                s_and_b64     exec, s[12:13], s[10:11]                // 000000000290: 87FE0A0C
                s_branch      label_009C                              // 000000000294: BF82FFF6
              label_00A6:
                s_mov_b64     exec, s[8:9]                            // 000000000298: BEFE0408
                v_cndmask_b32  v5, 0, v5, s[4:5]                      // 00000000029C: D2000005 00120A80
                v_add_i32     v2, vcc, v5, v2                         // 0000000002A4: 4A040505
                v_xor_b32     v5, 2, v3                               // 0000000002A8: 3A0A0682
                v_lshl_b64    v[6:7], 1, v5                           // 0000000002AC: D2C20006 00020A81
                v_and_b32     v6, exec_lo, v6                         // 0000000002B4: 360C0C7E
                v_and_b32     v7, exec_hi, v7                         // 0000000002B8: 360E0E7F
                v_cmp_ne_u64  s[4:5], 0, v[6:7]                       // 0000000002BC: D1CA0004 00020C80
                s_mov_b64     s[8:9], exec                            // 0000000002C4: BE88047E
                s_mov_b64     s[10:11], exec                          // 0000000002C8: BE8A047E
              label_00B3:
                v_readfirstlane_b32  s1, v5                           // 0000000002CC: 7E020505
                v_cmp_eq_u32  vcc, s1, v5                             // 0000000002D0: 7D840A01
                s_and_saveexec_b64  s[12:13], vcc                     // 0000000002D4: BE8C246A
                v_readlane_b32  s1, v2, s1                            // 0000000002D8: 02020302
                s_cbranch_execz  label_00BB                           // 0000000002DC: BF880003
                v_mov_b32     v5, s1                                  // 0000000002E0: 7E0A0201
                s_andn2_b64   s[10:11], s[10:11], exec                // 0000000002E4: 8A8A7E0A
                s_cbranch_scc0  label_00BD                            // 0000000002E8: BF840002
              label_00BB:
                s_and_b64     exec, s[12:13], s[10:11]                // 0000000002EC: 87FE0A0C
                s_branch      label_00B3                              // 0000000002F0: BF82FFF6
              label_00BD:
                s_mov_b64     exec, s[8:9]                            // 0000000002F4: BEFE0408
                v_cndmask_b32  v5, 0, v5, s[4:5]                      // 0000000002F8: D2000005 00120A80
                v_add_i32     v2, vcc, v5, v2                         // 000000000300: 4A040505
                v_xor_b32     v5, 4, v3                               // 000000000304: 3A0A0684
                v_lshl_b64    v[6:7], 1, v5                           // 000000000308: D2C20006 00020A81
                v_and_b32     v6, exec_lo, v6                         // 000000000310: 360C0C7E
                v_and_b32     v7, exec_hi, v7                         // 000000000314: 360E0E7F
                v_cmp_ne_u64  s[4:5], 0, v[6:7]                       // 000000000318: D1CA0004 00020C80
                s_mov_b64     s[8:9], exec                            // 000000000320: BE88047E
                s_mov_b64     s[10:11], exec                          // 000000000324: BE8A047E
              label_00CA:
                v_readfirstlane_b32  s1, v5                           // 000000000328: 7E020505
                v_cmp_eq_u32  vcc, s1, v5                             // 00000000032C: 7D840A01
                s_and_saveexec_b64  s[12:13], vcc                     // 000000000330: BE8C246A
                v_readlane_b32  s1, v2, s1                            // 000000000334: 02020302
                s_cbranch_execz  label_00D2                           // 000000000338: BF880003
                v_mov_b32     v5, s1                                  // 00000000033C: 7E0A0201
                s_andn2_b64   s[10:11], s[10:11], exec                // 000000000340: 8A8A7E0A
                s_cbranch_scc0  label_00D4                            // 000000000344: BF840002
              label_00D2:
                s_and_b64     exec, s[12:13], s[10:11]                // 000000000348: 87FE0A0C
                s_branch      label_00CA                              // 00000000034C: BF82FFF6
              label_00D4:
                s_mov_b64     exec, s[8:9]                            // 000000000350: BEFE0408
                v_cndmask_b32  v5, 0, v5, s[4:5]                      // 000000000354: D2000005 00120A80
                v_add_i32     v2, vcc, v5, v2                         // 00000000035C: 4A040505
                v_xor_b32     v5, 32, v3                              // 000000000360: 3A0A06A0
                v_xor_b32     v6, 16, v3                              // 000000000364: 3A0C0690
                v_xor_b32     v7, 8, v3                               // 000000000368: 3A0E0688
                v_lshl_b64    v[8:9], 1, v7                           // 00000000036C: D2C20008 00020E81
                v_and_b32     v8, exec_lo, v8                         // 000000000374: 3610107E
                v_and_b32     v9, exec_hi, v9                         // 000000000378: 3612127F
                v_cmp_ne_u64  s[4:5], 0, v[8:9]                       // 00000000037C: D1CA0004 00021080
                s_mov_b64     s[8:9], exec                            // 000000000384: BE88047E
                s_mov_b64     s[10:11], exec                          // 000000000388: BE8A047E
              label_00E3:
                v_readfirstlane_b32  s1, v7                           // 00000000038C: 7E020507
                v_cmp_eq_u32  vcc, s1, v7                             // 000000000390: 7D840E01
                s_and_saveexec_b64  s[12:13], vcc                     // 000000000394: BE8C246A
                v_readlane_b32  s1, v2, s1                            // 000000000398: 02020302
                s_cbranch_execz  label_00EB                           // 00000000039C: BF880003
                v_mov_b32     v7, s1                                  // 0000000003A0: 7E0E0201
                s_andn2_b64   s[10:11], s[10:11], exec                // 0000000003A4: 8A8A7E0A
                s_cbranch_scc0  label_00ED                            // 0000000003A8: BF840002
              label_00EB:
                s_and_b64     exec, s[12:13], s[10:11]                // 0000000003AC: 87FE0A0C
                s_branch      label_00E3                              // 0000000003B0: BF82FFF6
              label_00ED:
                s_mov_b64     exec, s[8:9]                            // 0000000003B4: BEFE0408
                v_cndmask_b32  v7, 0, v7, s[4:5]                      // 0000000003B8: D2000007 00120E80
                v_add_i32     v2, vcc, v7, v2                         // 0000000003C0: 4A040507
                v_lshl_b64    v[7:8], 1, v6                           // 0000000003C4: D2C20007 00020C81
                v_and_b32     v7, exec_lo, v7                         // 0000000003CC: 360E0E7E
                v_and_b32     v8, exec_hi, v8                         // 0000000003D0: 3610107F
                v_cmp_ne_u64  s[4:5], 0, v[7:8]                       // 0000000003D4: D1CA0004 00020E80
                s_mov_b64     s[8:9], exec                            // 0000000003DC: BE88047E
                s_mov_b64     s[10:11], exec                          // 0000000003E0: BE8A047E
              label_00F9:
                v_readfirstlane_b32  s1, v6                           // 0000000003E4: 7E020506
                v_cmp_eq_u32  vcc, s1, v6                             // 0000000003E8: 7D840C01
                s_and_saveexec_b64  s[12:13], vcc                     // 0000000003EC: BE8C246A
                v_readlane_b32  s1, v2, s1                            // 0000000003F0: 02020302
                s_cbranch_execz  label_0101                           // 0000000003F4: BF880003
                v_mov_b32     v6, s1                                  // 0000000003F8: 7E0C0201
                s_andn2_b64   s[10:11], s[10:11], exec                // 0000000003FC: 8A8A7E0A
                s_cbranch_scc0  label_0103                            // 000000000400: BF840002
              label_0101:
                s_and_b64     exec, s[12:13], s[10:11]                // 000000000404: 87FE0A0C
                s_branch      label_00F9                              // 000000000408: BF82FFF6
              label_0103:
                s_mov_b64     exec, s[8:9]                            // 00000000040C: BEFE0408
                v_cndmask_b32  v6, 0, v6, s[4:5]                      // 000000000410: D2000006 00120C80
                v_add_i32     v2, vcc, v6, v2                         // 000000000418: 4A040506
                v_lshl_b64    v[6:7], 1, v5                           // 00000000041C: D2C20006 00020A81
                v_and_b32     v6, exec_lo, v6                         // 000000000424: 360C0C7E
                v_and_b32     v7, exec_hi, v7                         // 000000000428: 360E0E7F
                v_cmp_ne_u64  s[4:5], 0, v[6:7]                       // 00000000042C: D1CA0004 00020C80
                s_mov_b64     s[8:9], exec                            // 000000000434: BE88047E
                s_mov_b64     s[10:11], exec                          // 000000000438: BE8A047E
              label_010F:
                v_readfirstlane_b32  s1, v5                           // 00000000043C: 7E020505
                v_cmp_eq_u32  vcc, s1, v5                             // 000000000440: 7D840A01
                s_and_saveexec_b64  s[12:13], vcc                     // 000000000444: BE8C246A
                v_readlane_b32  s1, v2, s1                            // 000000000448: 02020302
                s_cbranch_execz  label_0117                           // 00000000044C: BF880003
                v_mov_b32     v5, s1                                  // 000000000450: 7E0A0201
                s_andn2_b64   s[10:11], s[10:11], exec                // 000000000454: 8A8A7E0A
                s_cbranch_scc0  label_0119                            // 000000000458: BF840002
              label_0117:
                s_and_b64     exec, s[12:13], s[10:11]                // 00000000045C: 87FE0A0C
                s_branch      label_010F                              // 000000000460: BF82FFF6
              label_0119:
                s_mov_b64     exec, s[8:9]                            // 000000000464: BEFE0408
                v_cndmask_b32  v5, 0, v5, s[4:5]                      // 000000000468: D2000005 00120A80
                v_add_i32     v7, vcc, v5, v2                         // 000000000470: 4A0E0505
                s_branch      label_01C4                              // 000000000474: BF8200A6
              label_011E:
                v_cmp_eq_i32  s[4:5], v3, 0                           // 000000000478: D1040004 00010103
                v_add_i32     v5, vcc, 63, v3                         // 000000000480: 4A0A06BF
                v_and_b32     v5, 63, v5                              // 000000000484: 360A0ABF
                v_lshl_b64    v[6:7], 1, v5                           // 000000000488: D2C20006 00020A81
                v_and_b32     v6, exec_lo, v6                         // 000000000490: 360C0C7E
                v_and_b32     v7, exec_hi, v7                         // 000000000494: 360E0E7F
                v_cmp_ne_u64  vcc, 0, v[6:7]                          // 000000000498: 7DCA0C80
                s_orn2_b64    s[4:5], s[4:5], vcc                     // 00000000049C: 8B846A04
                s_mov_b64     s[8:9], exec                            // 0000000004A0: BE88047E
                s_mov_b64     s[10:11], exec                          // 0000000004A4: BE8A047E
              label_012A:
                v_readfirstlane_b32  s1, v5                           // 0000000004A8: 7E020505
                v_cmp_eq_u32  vcc, s1, v5                             // 0000000004AC: 7D840A01
                s_and_saveexec_b64  s[12:13], vcc                     // 0000000004B0: BE8C246A
                v_readlane_b32  s1, v2, s1                            // 0000000004B4: 02020302
                s_cbranch_execz  label_0132                           // 0000000004B8: BF880003
                v_mov_b32     v5, s1                                  // 0000000004BC: 7E0A0201
                s_andn2_b64   s[10:11], s[10:11], exec                // 0000000004C0: 8A8A7E0A
                s_cbranch_scc0  label_0134                            // 0000000004C4: BF840002
              label_0132:
                s_and_b64     exec, s[12:13], s[10:11]                // 0000000004C8: 87FE0A0C
                s_branch      label_012A                              // 0000000004CC: BF82FFF6
              label_0134:
                s_mov_b64     exec, s[8:9]                            // 0000000004D0: BEFE0408
                v_cndmask_b32  v5, v5, 0, s[4:5]                      // 0000000004D4: D2000005 00110105
                v_add_i32     v2, vcc, v5, v2                         // 0000000004DC: 4A040505
                v_cmp_lt_u32  s[4:5], v3, 2                           // 0000000004E0: D1820004 00010503
                v_add_i32     v5, vcc, 62, v3                         // 0000000004E8: 4A0A06BE
                v_and_b32     v5, 63, v5                              // 0000000004EC: 360A0ABF
                v_lshl_b64    v[6:7], 1, v5                           // 0000000004F0: D2C20006 00020A81
                v_and_b32     v6, exec_lo, v6                         // 0000000004F8: 360C0C7E
                v_and_b32     v7, exec_hi, v7                         // 0000000004FC: 360E0E7F
                v_cmp_ne_u64  vcc, 0, v[6:7]                          // 000000000500: 7DCA0C80
                s_orn2_b64    s[4:5], s[4:5], vcc                     // 000000000504: 8B846A04
                s_mov_b64     s[8:9], exec                            // 000000000508: BE88047E
                s_mov_b64     s[10:11], exec                          // 00000000050C: BE8A047E
              label_0144:
                v_readfirstlane_b32  s1, v5                           // 000000000510: 7E020505
                v_cmp_eq_u32  vcc, s1, v5                             // 000000000514: 7D840A01
                s_and_saveexec_b64  s[12:13], vcc                     // 000000000518: BE8C246A
                v_readlane_b32  s1, v2, s1                            // 00000000051C: 02020302
                s_cbranch_execz  label_014C                           // 000000000520: BF880003
                v_mov_b32     v5, s1                                  // 000000000524: 7E0A0201
                s_andn2_b64   s[10:11], s[10:11], exec                // 000000000528: 8A8A7E0A
                s_cbranch_scc0  label_014E                            // 00000000052C: BF840002
              label_014C:
                s_and_b64     exec, s[12:13], s[10:11]                // 000000000530: 87FE0A0C
                s_branch      label_0144                              // 000000000534: BF82FFF6
              label_014E:
                s_mov_b64     exec, s[8:9]                            // 000000000538: BEFE0408
                v_cndmask_b32  v5, v5, 0, s[4:5]                      // 00000000053C: D2000005 00110105
                v_add_i32     v2, vcc, v5, v2                         // 000000000544: 4A040505
                v_cmp_lt_u32  s[4:5], v3, 4                           // 000000000548: D1820004 00010903
                v_add_i32     v5, vcc, 60, v3                         // 000000000550: 4A0A06BC
                v_and_b32     v5, 63, v5                              // 000000000554: 360A0ABF
                v_lshl_b64    v[6:7], 1, v5                           // 000000000558: D2C20006 00020A81
                v_and_b32     v6, exec_lo, v6                         // 000000000560: 360C0C7E
                v_and_b32     v7, exec_hi, v7                         // 000000000564: 360E0E7F
                v_cmp_ne_u64  vcc, 0, v[6:7]                          // 000000000568: 7DCA0C80
                s_orn2_b64    s[4:5], s[4:5], vcc                     // 00000000056C: 8B846A04
                s_mov_b64     s[8:9], exec                            // 000000000570: BE88047E
                s_mov_b64     s[10:11], exec                          // 000000000574: BE8A047E
              label_015E:
                v_readfirstlane_b32  s1, v5                           // 000000000578: 7E020505
                v_cmp_eq_u32  vcc, s1, v5                             // 00000000057C: 7D840A01
                s_and_saveexec_b64  s[12:13], vcc                     // 000000000580: BE8C246A
                v_readlane_b32  s1, v2, s1                            // 000000000584: 02020302
                s_cbranch_execz  label_0166                           // 000000000588: BF880003
                v_mov_b32     v5, s1                                  // 00000000058C: 7E0A0201
                s_andn2_b64   s[10:11], s[10:11], exec                // 000000000590: 8A8A7E0A
                s_cbranch_scc0  label_0168                            // 000000000594: BF840002
              label_0166:
                s_and_b64     exec, s[12:13], s[10:11]                // 000000000598: 87FE0A0C
                s_branch      label_015E                              // 00000000059C: BF82FFF6
              label_0168:
                s_mov_b64     exec, s[8:9]                            // 0000000005A0: BEFE0408
                v_cndmask_b32  v5, v5, 0, s[4:5]                      // 0000000005A4: D2000005 00110105
                v_add_i32     v6, vcc, 56, v3                         // 0000000005AC: 4A0C06B8
                v_add_i32     v2, vcc, v5, v2                         // 0000000005B0: 4A040505
                v_cmp_lt_u32  s[4:5], v3, 8                           // 0000000005B4: D1820004 00011103
                v_and_b32     v5, 63, v6                              // 0000000005BC: 360A0CBF
                v_lshl_b64    v[6:7], 1, v5                           // 0000000005C0: D2C20006 00020A81
                v_and_b32     v6, exec_lo, v6                         // 0000000005C8: 360C0C7E
                v_and_b32     v7, exec_hi, v7                         // 0000000005CC: 360E0E7F
                v_cmp_ne_u64  vcc, 0, v[6:7]                          // 0000000005D0: 7DCA0C80
                s_orn2_b64    s[4:5], s[4:5], vcc                     // 0000000005D4: 8B846A04
                s_mov_b64     s[8:9], exec                            // 0000000005D8: BE88047E
                s_mov_b64     s[10:11], exec                          // 0000000005DC: BE8A047E
              label_0178:
                v_readfirstlane_b32  s1, v5                           // 0000000005E0: 7E020505
                v_cmp_eq_u32  vcc, s1, v5                             // 0000000005E4: 7D840A01
                s_and_saveexec_b64  s[12:13], vcc                     // 0000000005E8: BE8C246A
                v_readlane_b32  s1, v2, s1                            // 0000000005EC: 02020302
                s_cbranch_execz  label_0180                           // 0000000005F0: BF880003
                v_mov_b32     v5, s1                                  // 0000000005F4: 7E0A0201
                s_andn2_b64   s[10:11], s[10:11], exec                // 0000000005F8: 8A8A7E0A
                s_cbranch_scc0  label_0182                            // 0000000005FC: BF840002
              label_0180:
                s_and_b64     exec, s[12:13], s[10:11]                // 000000000600: 87FE0A0C
                s_branch      label_0178                              // 000000000604: BF82FFF6
              label_0182:
                s_mov_b64     exec, s[8:9]                            // 000000000608: BEFE0408
                v_cndmask_b32  v5, v5, 0, s[4:5]                      // 00000000060C: D2000005 00110105
                v_add_i32     v6, vcc, 48, v3                         // 000000000614: 4A0C06B0
                v_and_b32     v6, 63, v6                              // 000000000618: 360C0CBF
                v_add_i32     v2, vcc, v5, v2                         // 00000000061C: 4A040505
                v_cmp_lt_u32  s[4:5], v3, 16                          // 000000000620: D1820004 00012103
                v_lshl_b64    v[7:8], 1, v6                           // 000000000628: D2C20007 00020C81
                v_and_b32     v7, exec_lo, v7                         // 000000000630: 360E0E7E
                v_and_b32     v8, exec_hi, v8                         // 000000000634: 3610107F
                v_cmp_ne_u64  vcc, 0, v[7:8]                          // 000000000638: 7DCA0E80
                s_orn2_b64    s[4:5], s[4:5], vcc                     // 00000000063C: 8B846A04
                s_mov_b64     s[8:9], exec                            // 000000000640: BE88047E
                s_mov_b64     s[10:11], exec                          // 000000000644: BE8A047E
              label_0192:
                v_readfirstlane_b32  s1, v6                           // 000000000648: 7E020506
                v_cmp_eq_u32  vcc, s1, v6                             // 00000000064C: 7D840C01
                s_and_saveexec_b64  s[12:13], vcc                     // 000000000650: BE8C246A
                v_readlane_b32  s1, v2, s1                            // 000000000654: 02020302
                s_cbranch_execz  label_019A                           // 000000000658: BF880003
                v_mov_b32     v5, s1                                  // 00000000065C: 7E0A0201
                s_andn2_b64   s[10:11], s[10:11], exec                // 000000000660: 8A8A7E0A
                s_cbranch_scc0  label_019C                            // 000000000664: BF840002
              label_019A:
                s_and_b64     exec, s[12:13], s[10:11]                // 000000000668: 87FE0A0C
                s_branch      label_0192                              // 00000000066C: BF82FFF6
              label_019C:
                s_mov_b64     exec, s[8:9]                            // 000000000670: BEFE0408
                v_cndmask_b32  v5, v5, 0, s[4:5]                      // 000000000674: D2000005 00110105
                v_add_i32     v6, vcc, 32, v3                         // 00000000067C: 4A0C06A0
                v_and_b32     v6, 63, v6                              // 000000000680: 360C0CBF
                v_add_i32     v2, vcc, v5, v2                         // 000000000684: 4A040505
                v_cmp_lt_u32  s[4:5], v3, 32                          // 000000000688: D1820004 00014103
                v_lshl_b64    v[7:8], 1, v6                           // 000000000690: D2C20007 00020C81
                v_and_b32     v7, exec_lo, v7                         // 000000000698: 360E0E7E
                v_and_b32     v8, exec_hi, v8                         // 00000000069C: 3610107F
                v_cmp_ne_u64  vcc, 0, v[7:8]                          // 0000000006A0: 7DCA0E80
                s_orn2_b64    s[4:5], s[4:5], vcc                     // 0000000006A4: 8B846A04
                s_mov_b64     s[8:9], exec                            // 0000000006A8: BE88047E
                s_mov_b64     s[10:11], exec                          // 0000000006AC: BE8A047E
              label_01AC:
                v_readfirstlane_b32  s1, v6                           // 0000000006B0: 7E020506
                v_cmp_eq_u32  vcc, s1, v6                             // 0000000006B4: 7D840C01
                s_and_saveexec_b64  s[12:13], vcc                     // 0000000006B8: BE8C246A
                v_readlane_b32  s1, v2, s1                            // 0000000006BC: 02020302
                s_cbranch_execz  label_01B4                           // 0000000006C0: BF880003
                v_mov_b32     v5, s1                                  // 0000000006C4: 7E0A0201
                s_andn2_b64   s[10:11], s[10:11], exec                // 0000000006C8: 8A8A7E0A
                s_cbranch_scc0  label_01B6                            // 0000000006CC: BF840002
              label_01B4:
                s_and_b64     exec, s[12:13], s[10:11]                // 0000000006D0: 87FE0A0C
                s_branch      label_01AC                              // 0000000006D4: BF82FFF6
              label_01B6:
                s_mov_b64     exec, s[8:9]                            // 0000000006D8: BEFE0408
                v_cndmask_b32  v5, v5, 0, s[4:5]                      // 0000000006DC: D2000005 00110105
                s_add_u32     s1, s2, -1                              // 0000000006E4: 8001C102
                v_add_i32     v2, vcc, v5, v2                         // 0000000006E8: 4A040505
                v_mov_b32     v5, s1                                  // 0000000006EC: 7E0A0201
                v_lshl_b64    v[5:6], 1, v5                           // 0000000006F0: D2C20005 00020A81
                v_and_b32     v5, exec_lo, v5                         // 0000000006F8: 360A0A7E
                v_and_b32     v6, exec_hi, v6                         // 0000000006FC: 360C0C7F
                v_cmp_ne_u64  vcc, 0, v[5:6]                          // 000000000700: 7DCA0A80
                v_readlane_b32  s1, v2, s1                            // 000000000704: 02020302
                v_mov_b32     v2, s1                                  // 000000000708: 7E040201
                v_cndmask_b32  v7, 0, v2, vcc                         // 00000000070C: 000E0480
              label_01C4:
                s_cmp_eq_i32  s0, 1                                   // 000000000710: BF008100
                s_cbranch_scc1  label_0320                            // 000000000714: BF85015A
                v_lshrrev_b32  v5, 6, v1                              // 000000000718: 2C0A0286
                v_and_b32     v1, 63, v1                              // 00000000071C: 360202BF
                v_cmp_ne_i32  s[4:5], v1, 0                           // 000000000720: D10A0004 00010101
                s_mov_b64     s[8:9], exec                            // 000000000728: BE88047E
                s_andn2_b64   exec, s[8:9], s[4:5]                    // 00000000072C: 8AFE0408
                v_lshlrev_b32  v6, 2, v5                              // 000000000730: 340C0A82
                ds_write_b32  v6, v7                                  // 000000000734: D8340000 00000706
                s_mov_b64     exec, s[8:9]                            // 00000000073C: BEFE0408
                s_waitcnt     lgkmcnt(0)                              // 000000000740: BF8C007F
                s_barrier                                             // 000000000744: BF8A0000
                v_cmp_eq_i32  vcc, 0, v5                              // 000000000748: 7D040A80
                s_and_b64     exec, s[8:9], vcc                       // 00000000074C: 87FE6A08
                v_cmp_gt_u32  vcc, s0, v1                             // 000000000750: 7D880200
                s_cbranch_execz  label_0318                           // 000000000754: BF880142
                s_and_saveexec_b64  s[0:1], vcc                       // 000000000758: BE80246A
                v_lshlrev_b32  v1, 2, v1                              // 00000000075C: 34020282
                ds_read_b32   v1, v1                                  // 000000000760: D8D80000 01000001
                s_andn2_b64   exec, s[0:1], exec                      // 000000000768: 8AFE7E00
                v_mov_b32     v1, 0                                   // 00000000076C: 7E020280
                s_mov_b64     exec, s[0:1]                            // 000000000770: BEFE0400
                v_mov_b32     v2, 1                                   // 000000000774: 7E040281
                v_cmp_eq_i32  vcc, v4, v2                             // 000000000778: 7D040504
                s_cbranch_vccz  label_026C                            // 00000000077C: BF86008C
                v_xor_b32     v2, 1, v3                               // 000000000780: 3A040681
                v_lshl_b64    v[4:5], 1, v2                           // 000000000784: D2C20004 00020481
                v_and_b32     v4, exec_lo, v4                         // 00000000078C: 3608087E
                v_and_b32     v5, exec_hi, v5                         // 000000000790: 360A0A7F
                v_cmp_ne_u64  s[0:1], 0, v[4:5]                       // 000000000794: D1CA0000 00020880
                s_mov_b64     s[2:3], exec                            // 00000000079C: BE82047E
                s_mov_b64     s[10:11], exec                          // 0000000007A0: BE8A047E
              label_01E9:
                v_readfirstlane_b32  s12, v2                          // 0000000007A4: 7E180502
                v_cmp_eq_u32  vcc, s12, v2                            // 0000000007A8: 7D84040C
                s_and_saveexec_b64  s[14:15], vcc                     // 0000000007AC: BE8E246A
                s_waitcnt     lgkmcnt(0)                              // 0000000007B0: BF8C007F
                v_readlane_b32  s12, v1, s12                          // 0000000007B4: 02181901
                s_cbranch_execz  label_01F2                           // 0000000007B8: BF880003
                v_mov_b32     v2, s12                                 // 0000000007BC: 7E04020C
                s_andn2_b64   s[10:11], s[10:11], exec                // 0000000007C0: 8A8A7E0A
                s_cbranch_scc0  label_01F4                            // 0000000007C4: BF840002
              label_01F2:
                s_mov_b64     exec, s[10:11]                          // 0000000007C8: BEFE040A
                s_branch      label_01E9                              // 0000000007CC: BF82FFF5
              label_01F4:
                s_mov_b64     exec, s[2:3]                            // 0000000007D0: BEFE0402
                v_cndmask_b32  v2, 0, v2, s[0:1]                      // 0000000007D4: D2000002 00020480
                v_add_i32     v1, vcc, v2, v1                         // 0000000007DC: 4A020302
                v_xor_b32     v2, 2, v3                               // 0000000007E0: 3A040682
                v_lshl_b64    v[4:5], 1, v2                           // 0000000007E4: D2C20004 00020481
                v_and_b32     v4, exec_lo, v4                         // 0000000007EC: 3608087E
                v_and_b32     v5, exec_hi, v5                         // 0000000007F0: 360A0A7F
                v_cmp_ne_u64  s[0:1], 0, v[4:5]                       // 0000000007F4: D1CA0000 00020880
                s_mov_b64     s[2:3], exec                            // 0000000007FC: BE82047E
                s_mov_b64     s[10:11], exec                          // 000000000800: BE8A047E
              label_0201:
                v_readfirstlane_b32  s12, v2                          // 000000000804: 7E180502
                v_cmp_eq_u32  vcc, s12, v2                            // 000000000808: 7D84040C
                s_and_saveexec_b64  s[14:15], vcc                     // 00000000080C: BE8E246A
                v_readlane_b32  s12, v1, s12                          // 000000000810: 02181901
                s_cbranch_execz  label_0209                           // 000000000814: BF880003
                v_mov_b32     v2, s12                                 // 000000000818: 7E04020C
                s_andn2_b64   s[10:11], s[10:11], exec                // 00000000081C: 8A8A7E0A
                s_cbranch_scc0  label_020B                            // 000000000820: BF840002
              label_0209:
                s_mov_b64     exec, s[10:11]                          // 000000000824: BEFE040A
                s_branch      label_0201                              // 000000000828: BF82FFF6
              label_020B:
                s_mov_b64     exec, s[2:3]                            // 00000000082C: BEFE0402
                v_cndmask_b32  v2, 0, v2, s[0:1]                      // 000000000830: D2000002 00020480
                v_add_i32     v1, vcc, v2, v1                         // 000000000838: 4A020302
                v_xor_b32     v2, 4, v3                               // 00000000083C: 3A040684
                v_lshl_b64    v[4:5], 1, v2                           // 000000000840: D2C20004 00020481
                v_and_b32     v4, exec_lo, v4                         // 000000000848: 3608087E
                v_and_b32     v5, exec_hi, v5                         // 00000000084C: 360A0A7F
                v_cmp_ne_u64  s[0:1], 0, v[4:5]                       // 000000000850: D1CA0000 00020880
                s_mov_b64     s[2:3], exec                            // 000000000858: BE82047E
                s_mov_b64     s[10:11], exec                          // 00000000085C: BE8A047E
              label_0218:
                v_readfirstlane_b32  s12, v2                          // 000000000860: 7E180502
                v_cmp_eq_u32  vcc, s12, v2                            // 000000000864: 7D84040C
                s_and_saveexec_b64  s[14:15], vcc                     // 000000000868: BE8E246A
                v_readlane_b32  s12, v1, s12                          // 00000000086C: 02181901
                s_cbranch_execz  label_0220                           // 000000000870: BF880003
                v_mov_b32     v2, s12                                 // 000000000874: 7E04020C
                s_andn2_b64   s[10:11], s[10:11], exec                // 000000000878: 8A8A7E0A
                s_cbranch_scc0  label_0222                            // 00000000087C: BF840002
              label_0220:
                s_mov_b64     exec, s[10:11]                          // 000000000880: BEFE040A
                s_branch      label_0218                              // 000000000884: BF82FFF6
              label_0222:
                s_mov_b64     exec, s[2:3]                            // 000000000888: BEFE0402
                v_cndmask_b32  v2, 0, v2, s[0:1]                      // 00000000088C: D2000002 00020480
                v_add_i32     v1, vcc, v2, v1                         // 000000000894: 4A020302
                v_xor_b32     v2, 32, v3                              // 000000000898: 3A0406A0
                v_xor_b32     v4, 16, v3                              // 00000000089C: 3A080690
                v_xor_b32     v3, 8, v3                               // 0000000008A0: 3A060688
                v_lshl_b64    v[5:6], 1, v3                           // 0000000008A4: D2C20005 00020681
                v_and_b32     v5, exec_lo, v5                         // 0000000008AC: 360A0A7E
                v_and_b32     v6, exec_hi, v6                         // 0000000008B0: 360C0C7F
                v_cmp_ne_u64  s[0:1], 0, v[5:6]                       // 0000000008B4: D1CA0000 00020A80
                s_mov_b64     s[2:3], exec                            // 0000000008BC: BE82047E
                s_mov_b64     s[10:11], exec                          // 0000000008C0: BE8A047E
              label_0231:
                v_readfirstlane_b32  s12, v3                          // 0000000008C4: 7E180503
                v_cmp_eq_u32  vcc, s12, v3                            // 0000000008C8: 7D84060C
                s_and_saveexec_b64  s[14:15], vcc                     // 0000000008CC: BE8E246A
                v_readlane_b32  s12, v1, s12                          // 0000000008D0: 02181901
                s_cbranch_execz  label_0239                           // 0000000008D4: BF880003
                v_mov_b32     v3, s12                                 // 0000000008D8: 7E06020C
                s_andn2_b64   s[10:11], s[10:11], exec                // 0000000008DC: 8A8A7E0A
                s_cbranch_scc0  label_023B                            // 0000000008E0: BF840002
              label_0239:
                s_mov_b64     exec, s[10:11]                          // 0000000008E4: BEFE040A
                s_branch      label_0231                              // 0000000008E8: BF82FFF6
              label_023B:
                s_mov_b64     exec, s[2:3]                            // 0000000008EC: BEFE0402
                v_cndmask_b32  v3, 0, v3, s[0:1]                      // 0000000008F0: D2000003 00020680
                v_add_i32     v1, vcc, v3, v1                         // 0000000008F8: 4A020303
                v_lshl_b64    v[5:6], 1, v4                           // 0000000008FC: D2C20005 00020881
                v_and_b32     v5, exec_lo, v5                         // 000000000904: 360A0A7E
                v_and_b32     v6, exec_hi, v6                         // 000000000908: 360C0C7F
                v_cmp_ne_u64  s[0:1], 0, v[5:6]                       // 00000000090C: D1CA0000 00020A80
                s_mov_b64     s[2:3], exec                            // 000000000914: BE82047E
                s_mov_b64     s[10:11], exec                          // 000000000918: BE8A047E
              label_0247:
                v_readfirstlane_b32  s12, v4                          // 00000000091C: 7E180504
                v_cmp_eq_u32  vcc, s12, v4                            // 000000000920: 7D84080C
                s_and_saveexec_b64  s[14:15], vcc                     // 000000000924: BE8E246A
                v_readlane_b32  s12, v1, s12                          // 000000000928: 02181901
                s_cbranch_execz  label_024F                           // 00000000092C: BF880003
                v_mov_b32     v3, s12                                 // 000000000930: 7E06020C
                s_andn2_b64   s[10:11], s[10:11], exec                // 000000000934: 8A8A7E0A
                s_cbranch_scc0  label_0251                            // 000000000938: BF840002
              label_024F:
                s_mov_b64     exec, s[10:11]                          // 00000000093C: BEFE040A
                s_branch      label_0247                              // 000000000940: BF82FFF6
              label_0251:
                s_mov_b64     exec, s[2:3]                            // 000000000944: BEFE0402
                v_cndmask_b32  v3, 0, v3, s[0:1]                      // 000000000948: D2000003 00020680
                v_add_i32     v1, vcc, v3, v1                         // 000000000950: 4A020303
                v_lshl_b64    v[3:4], 1, v2                           // 000000000954: D2C20003 00020481
                v_and_b32     v3, exec_lo, v3                         // 00000000095C: 3606067E
                v_and_b32     v4, exec_hi, v4                         // 000000000960: 3608087F
                v_cmp_ne_u64  s[0:1], 0, v[3:4]                       // 000000000964: D1CA0000 00020680
                s_mov_b64     s[2:3], exec                            // 00000000096C: BE82047E
                s_mov_b64     s[10:11], exec                          // 000000000970: BE8A047E
              label_025D:
                v_readfirstlane_b32  s12, v2                          // 000000000974: 7E180502
                v_cmp_eq_u32  vcc, s12, v2                            // 000000000978: 7D84040C
                s_and_saveexec_b64  s[14:15], vcc                     // 00000000097C: BE8E246A
                v_readlane_b32  s12, v1, s12                          // 000000000980: 02181901
                s_cbranch_execz  label_0265                           // 000000000984: BF880003
                v_mov_b32     v2, s12                                 // 000000000988: 7E04020C
                s_andn2_b64   s[10:11], s[10:11], exec                // 00000000098C: 8A8A7E0A
                s_cbranch_scc0  label_0267                            // 000000000990: BF840002
              label_0265:
                s_mov_b64     exec, s[10:11]                          // 000000000994: BEFE040A
                s_branch      label_025D                              // 000000000998: BF82FFF6
              label_0267:
                s_mov_b64     exec, s[2:3]                            // 00000000099C: BEFE0402
                v_cndmask_b32  v2, 0, v2, s[0:1]                      // 0000000009A0: D2000002 00020480
                v_add_i32     v1, vcc, v2, v1                         // 0000000009A8: 4A020302
                s_branch      label_0313                              // 0000000009AC: BF8200A7
              label_026C:
                v_cmp_eq_i32  s[0:1], v3, 0                           // 0000000009B0: D1040000 00010103
                v_add_i32     v2, vcc, 63, v3                         // 0000000009B8: 4A0406BF
                v_and_b32     v2, 63, v2                              // 0000000009BC: 360404BF
                v_lshl_b64    v[4:5], 1, v2                           // 0000000009C0: D2C20004 00020481
                v_and_b32     v4, exec_lo, v4                         // 0000000009C8: 3608087E
                v_and_b32     v5, exec_hi, v5                         // 0000000009CC: 360A0A7F
                v_cmp_ne_u64  vcc, 0, v[4:5]                          // 0000000009D0: 7DCA0880
                s_orn2_b64    s[0:1], s[0:1], vcc                     // 0000000009D4: 8B806A00
                s_mov_b64     s[10:11], exec                          // 0000000009D8: BE8A047E
                s_mov_b64     s[12:13], exec                          // 0000000009DC: BE8C047E
              label_0278:
                v_readfirstlane_b32  s3, v2                           // 0000000009E0: 7E060502
                v_cmp_eq_u32  vcc, s3, v2                             // 0000000009E4: 7D840403
                s_and_saveexec_b64  s[14:15], vcc                     // 0000000009E8: BE8E246A
                s_waitcnt     lgkmcnt(0)                              // 0000000009EC: BF8C007F
                v_readlane_b32  s3, v1, s3                            // 0000000009F0: 02060701
                s_cbranch_execz  label_0281                           // 0000000009F4: BF880003
                v_mov_b32     v2, s3                                  // 0000000009F8: 7E040203
                s_andn2_b64   s[12:13], s[12:13], exec                // 0000000009FC: 8A8C7E0C
                s_cbranch_scc0  label_0283                            // 000000000A00: BF840002
              label_0281:
                s_mov_b64     exec, s[12:13]                          // 000000000A04: BEFE040C
                s_branch      label_0278                              // 000000000A08: BF82FFF5
              label_0283:
                s_mov_b64     exec, s[10:11]                          // 000000000A0C: BEFE040A
                v_cndmask_b32  v2, v2, 0, s[0:1]                      // 000000000A10: D2000002 00010102
                v_add_i32     v1, vcc, v2, v1                         // 000000000A18: 4A020302
                v_cmp_lt_u32  s[0:1], v3, 2                           // 000000000A1C: D1820000 00010503
                v_add_i32     v2, vcc, 62, v3                         // 000000000A24: 4A0406BE
                v_and_b32     v2, 63, v2                              // 000000000A28: 360404BF
                v_lshl_b64    v[4:5], 1, v2                           // 000000000A2C: D2C20004 00020481
                v_and_b32     v4, exec_lo, v4                         // 000000000A34: 3608087E
                v_and_b32     v5, exec_hi, v5                         // 000000000A38: 360A0A7F
                v_cmp_ne_u64  vcc, 0, v[4:5]                          // 000000000A3C: 7DCA0880
                s_orn2_b64    s[0:1], s[0:1], vcc                     // 000000000A40: 8B806A00
                s_mov_b64     s[10:11], exec                          // 000000000A44: BE8A047E
                s_mov_b64     s[12:13], exec                          // 000000000A48: BE8C047E
              label_0293:
                v_readfirstlane_b32  s3, v2                           // 000000000A4C: 7E060502
                v_cmp_eq_u32  vcc, s3, v2                             // 000000000A50: 7D840403
                s_and_saveexec_b64  s[14:15], vcc                     // 000000000A54: BE8E246A
                v_readlane_b32  s3, v1, s3                            // 000000000A58: 02060701
                s_cbranch_execz  label_029B                           // 000000000A5C: BF880003
                v_mov_b32     v2, s3                                  // 000000000A60: 7E040203
                s_andn2_b64   s[12:13], s[12:13], exec                // 000000000A64: 8A8C7E0C
                s_cbranch_scc0  label_029D                            // 000000000A68: BF840002
              label_029B:
                s_mov_b64     exec, s[12:13]                          // 000000000A6C: BEFE040C
                s_branch      label_0293                              // 000000000A70: BF82FFF6
              label_029D:
                s_mov_b64     exec, s[10:11]                          // 000000000A74: BEFE040A
                v_cndmask_b32  v2, v2, 0, s[0:1]                      // 000000000A78: D2000002 00010102
                v_add_i32     v1, vcc, v2, v1                         // 000000000A80: 4A020302
                v_cmp_lt_u32  s[0:1], v3, 4                           // 000000000A84: D1820000 00010903
                v_add_i32     v2, vcc, 60, v3                         // 000000000A8C: 4A0406BC
                v_and_b32     v2, 63, v2                              // 000000000A90: 360404BF
                v_lshl_b64    v[4:5], 1, v2                           // 000000000A94: D2C20004 00020481
                v_and_b32     v4, exec_lo, v4                         // 000000000A9C: 3608087E
                v_and_b32     v5, exec_hi, v5                         // 000000000AA0: 360A0A7F
                v_cmp_ne_u64  vcc, 0, v[4:5]                          // 000000000AA4: 7DCA0880
                s_orn2_b64    s[0:1], s[0:1], vcc                     // 000000000AA8: 8B806A00
                s_mov_b64     s[10:11], exec                          // 000000000AAC: BE8A047E
                s_mov_b64     s[12:13], exec                          // 000000000AB0: BE8C047E
              label_02AD:
                v_readfirstlane_b32  s3, v2                           // 000000000AB4: 7E060502
                v_cmp_eq_u32  vcc, s3, v2                             // 000000000AB8: 7D840403
                s_and_saveexec_b64  s[14:15], vcc                     // 000000000ABC: BE8E246A
                v_readlane_b32  s3, v1, s3                            // 000000000AC0: 02060701
                s_cbranch_execz  label_02B5                           // 000000000AC4: BF880003
                v_mov_b32     v2, s3                                  // 000000000AC8: 7E040203
                s_andn2_b64   s[12:13], s[12:13], exec                // 000000000ACC: 8A8C7E0C
                s_cbranch_scc0  label_02B7                            // 000000000AD0: BF840002
              label_02B5:
                s_mov_b64     exec, s[12:13]                          // 000000000AD4: BEFE040C
                s_branch      label_02AD                              // 000000000AD8: BF82FFF6
              label_02B7:
                s_mov_b64     exec, s[10:11]                          // 000000000ADC: BEFE040A
                v_cndmask_b32  v2, v2, 0, s[0:1]                      // 000000000AE0: D2000002 00010102
                v_add_i32     v4, vcc, 56, v3                         // 000000000AE8: 4A0806B8
                v_add_i32     v1, vcc, v2, v1                         // 000000000AEC: 4A020302
                v_cmp_lt_u32  s[0:1], v3, 8                           // 000000000AF0: D1820000 00011103
                v_and_b32     v2, 63, v4                              // 000000000AF8: 360408BF
                v_lshl_b64    v[4:5], 1, v2                           // 000000000AFC: D2C20004 00020481
                v_and_b32     v4, exec_lo, v4                         // 000000000B04: 3608087E
                v_and_b32     v5, exec_hi, v5                         // 000000000B08: 360A0A7F
                v_cmp_ne_u64  vcc, 0, v[4:5]                          // 000000000B0C: 7DCA0880
                s_orn2_b64    s[0:1], s[0:1], vcc                     // 000000000B10: 8B806A00
                s_mov_b64     s[10:11], exec                          // 000000000B14: BE8A047E
                s_mov_b64     s[12:13], exec                          // 000000000B18: BE8C047E
              label_02C7:
                v_readfirstlane_b32  s3, v2                           // 000000000B1C: 7E060502
                v_cmp_eq_u32  vcc, s3, v2                             // 000000000B20: 7D840403
                s_and_saveexec_b64  s[14:15], vcc                     // 000000000B24: BE8E246A
                v_readlane_b32  s3, v1, s3                            // 000000000B28: 02060701
                s_cbranch_execz  label_02CF                           // 000000000B2C: BF880003
                v_mov_b32     v2, s3                                  // 000000000B30: 7E040203
                s_andn2_b64   s[12:13], s[12:13], exec                // 000000000B34: 8A8C7E0C
                s_cbranch_scc0  label_02D1                            // 000000000B38: BF840002
              label_02CF:
                s_mov_b64     exec, s[12:13]                          // 000000000B3C: BEFE040C
                s_branch      label_02C7                              // 000000000B40: BF82FFF6
              label_02D1:
                s_mov_b64     exec, s[10:11]                          // 000000000B44: BEFE040A
                v_cndmask_b32  v2, v2, 0, s[0:1]                      // 000000000B48: D2000002 00010102
                v_add_i32     v4, vcc, 48, v3                         // 000000000B50: 4A0806B0
                v_and_b32     v4, 63, v4                              // 000000000B54: 360808BF
                v_add_i32     v1, vcc, v2, v1                         // 000000000B58: 4A020302
                v_cmp_lt_u32  s[0:1], v3, 16                          // 000000000B5C: D1820000 00012103
                v_lshl_b64    v[5:6], 1, v4                           // 000000000B64: D2C20005 00020881
                v_and_b32     v5, exec_lo, v5                         // 000000000B6C: 360A0A7E
                v_and_b32     v6, exec_hi, v6                         // 000000000B70: 360C0C7F
                v_cmp_ne_u64  vcc, 0, v[5:6]                          // 000000000B74: 7DCA0A80
                s_orn2_b64    s[0:1], s[0:1], vcc                     // 000000000B78: 8B806A00
                s_mov_b64     s[10:11], exec                          // 000000000B7C: BE8A047E
                s_mov_b64     s[12:13], exec                          // 000000000B80: BE8C047E
              label_02E1:
                v_readfirstlane_b32  s3, v4                           // 000000000B84: 7E060504
                v_cmp_eq_u32  vcc, s3, v4                             // 000000000B88: 7D840803
                s_and_saveexec_b64  s[14:15], vcc                     // 000000000B8C: BE8E246A
                v_readlane_b32  s3, v1, s3                            // 000000000B90: 02060701
                s_cbranch_execz  label_02E9                           // 000000000B94: BF880003
                v_mov_b32     v2, s3                                  // 000000000B98: 7E040203
                s_andn2_b64   s[12:13], s[12:13], exec                // 000000000B9C: 8A8C7E0C
                s_cbranch_scc0  label_02EB                            // 000000000BA0: BF840002
              label_02E9:
                s_mov_b64     exec, s[12:13]                          // 000000000BA4: BEFE040C
                s_branch      label_02E1                              // 000000000BA8: BF82FFF6
              label_02EB:
                s_mov_b64     exec, s[10:11]                          // 000000000BAC: BEFE040A
                v_cndmask_b32  v2, v2, 0, s[0:1]                      // 000000000BB0: D2000002 00010102
                v_add_i32     v4, vcc, 32, v3                         // 000000000BB8: 4A0806A0
                v_and_b32     v4, 63, v4                              // 000000000BBC: 360808BF
                v_add_i32     v1, vcc, v2, v1                         // 000000000BC0: 4A020302
                v_cmp_lt_u32  s[0:1], v3, 32                          // 000000000BC4: D1820000 00014103
                v_lshl_b64    v[2:3], 1, v4                           // 000000000BCC: D2C20002 00020881
                v_and_b32     v2, exec_lo, v2                         // 000000000BD4: 3604047E
                v_and_b32     v3, exec_hi, v3                         // 000000000BD8: 3606067F
                v_cmp_ne_u64  vcc, 0, v[2:3]                          // 000000000BDC: 7DCA0480
                s_orn2_b64    s[0:1], s[0:1], vcc                     // 000000000BE0: 8B806A00
                s_mov_b64     s[10:11], exec                          // 000000000BE4: BE8A047E
                s_mov_b64     s[12:13], exec                          // 000000000BE8: BE8C047E
              label_02FB:
                v_readfirstlane_b32  s3, v4                           // 000000000BEC: 7E060504
                v_cmp_eq_u32  vcc, s3, v4                             // 000000000BF0: 7D840803
                s_and_saveexec_b64  s[14:15], vcc                     // 000000000BF4: BE8E246A
                v_readlane_b32  s3, v1, s3                            // 000000000BF8: 02060701
                s_cbranch_execz  label_0303                           // 000000000BFC: BF880003
                v_mov_b32     v2, s3                                  // 000000000C00: 7E040203
                s_andn2_b64   s[12:13], s[12:13], exec                // 000000000C04: 8A8C7E0C
                s_cbranch_scc0  label_0305                            // 000000000C08: BF840002
              label_0303:
                s_mov_b64     exec, s[12:13]                          // 000000000C0C: BEFE040C
                s_branch      label_02FB                              // 000000000C10: BF82FFF6
              label_0305:
                s_mov_b64     exec, s[10:11]                          // 000000000C14: BEFE040A
                v_cndmask_b32  v2, v2, 0, s[0:1]                      // 000000000C18: D2000002 00010102
                s_add_u32     s0, s2, -1                              // 000000000C20: 8000C102
                v_add_i32     v1, vcc, v2, v1                         // 000000000C24: 4A020302
                v_mov_b32     v2, s0                                  // 000000000C28: 7E040200
                v_lshl_b64    v[2:3], 1, v2                           // 000000000C2C: D2C20002 00020481
                v_and_b32     v2, exec_lo, v2                         // 000000000C34: 3604047E
                v_and_b32     v3, exec_hi, v3                         // 000000000C38: 3606067F
                v_cmp_ne_u64  vcc, 0, v[2:3]                          // 000000000C3C: 7DCA0480
                v_readlane_b32  s0, v1, s0                            // 000000000C40: 02000101
                v_mov_b32     v1, s0                                  // 000000000C44: 7E020200
                v_cndmask_b32  v1, 0, v1, vcc                         // 000000000C48: 00020280
              label_0313:
                s_mov_b64     s[0:1], exec                            // 000000000C4C: BE80047E
                s_andn2_b64   exec, s[0:1], s[4:5]                    // 000000000C50: 8AFE0400
                v_mov_b32     v2, 0                                   // 000000000C54: 7E040280
                ds_write_b32  v2, v1                                  // 000000000C58: D8340000 00000102
              label_0318:
                s_mov_b64     exec, s[8:9]                            // 000000000C60: BEFE0408
                s_waitcnt     lgkmcnt(0)                              // 000000000C64: BF8C007F
                s_barrier                                             // 000000000C68: BF8A0000
                v_mov_b32     v1, 0                                   // 000000000C6C: 7E020280
                ds_read_b32   v7, v1                                  // 000000000C70: D8D80000 07000001
                s_waitcnt     lgkmcnt(0)                              // 000000000C78: BF8C007F
                s_barrier                                             // 000000000C7C: BF8A0000
              label_0320:
                v_cmp_eq_i32  vcc, 0, v0                              // 000000000C80: 7D040080
                s_and_saveexec_b64  s[0:1], vcc                       // 000000000C84: BE80246A
                s_cbranch_execz  label_0329                           // 000000000C88: BF880006
                s_load_dwordx2  s[2:3], s[6:7], 0x0c                  // 000000000C8C: C041070C
                s_waitcnt     lgkmcnt(0)                              // 000000000C90: BF8C007F
                v_mov_b32     v2, s2                                  // 000000000C94: 7E040202
                v_mov_b32     v3, s3                                  // 000000000C98: 7E060203
                flat_atomic_add  v0, v[2:3], v7                       // 000000000C9C: DCC80000 00000702
              label_0329:
                s_endpgm                                              // 000000000CA4: BF810000
              end
              ; ----------------- CS Data ------------------------
              
              
              codeLenInByte        = 3752 bytes;
              objectHeaderByteSize = 512 bytes;
              
              
              ; launchModeFlags    = 0x00000200
              isAql                = TRUE;
              
              
              userElementCount     = 0;
              extUserElementCount  = 0;
              NumVgprs             = 10;
              NumSgprs             = 23;
              FloatMode            = 192;
              IeeeMode             = 0;
              FlatPtr32            = 0;
              ScratchSize          = 0 dwords/thread;
              LDSByteSize          = 32 bytes/workgroup (compile time only);
              ScratchWaveOffsetReg = s65535;
              ; texSamplerUsage         = 0x00000000
              ; constBufUsage           = 0x00000000
              
              
              ;COMPUTE_PGM_RSRC2       = 0x00009390
              COMPUTE_PGM_RSRC2:USER_SGPR      = 8
              COMPUTE_PGM_RSRC2:TGID_X_EN      = 1
              COMPUTE_PGM_RSRC2:TGID_Y_EN      = 1
              COMPUTE_PGM_RSRC2:TGID_Z_EN      = 1
              COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT = 2
              COMPUTE_PGM_RSRC2:LDS_SIZE       = 1
              ; ---------------------------- HSA Code Descriptor ----------------------------
              code_type                                      = HSA_EXT_CODE_KERNEL
              workgroup_group_segment_byte_size              = 32
              kernarg_segment_byte_size                      = 56
              code.handle                                    = 256%
              kernarg_segment_alignment                      = 3
              group_segment_alignment                        = 3
              wavefront_size                                 = 6
              program_call_convention                        = 0
              module.handle                                  = 0
              symbol                                         = 3932
              hsail_profile                                  = HSA_EXT_BRIG_PROFILE_FULL
              hsail_machine_model                            = HSA_EXT_BRIG_MACHINE_LARGE
              debug_information.handle                       = 0
              agent_vendor                                   = AMD
              agent_name                                     = gfx7
              hsail_version_major                            = 0
              hsail_version_minor                            = 20140528
              ; ---------------------------- AMD Code Descriptor ----------------------------
              amd_code_version_major                         = 0
              amd_code_version_minor                         = 0
              struct_byte_size                               = 256
              target_chip                                    = gfx7
              kernel_code_entry_byte_offset                  = 256
              max_scratch_backing_memory_byte_size           = 0
              ; compute_pgm_rsrc1:
                granulated_workitem_vgpr_count                       = 2
                granulated_wavefront_sgpr_count                      = 3
                priority                                             = 0
                float_mode_round_32                                  = ROUND_TO_NEAREST_EVEN
                float_mode_round_16_64                               = ROUND_TO_NEAREST_EVEN
                float_mode_denorm_32                                 = FLUSH_SOURCE_OUTPUT
                float_mode_denorm_16_64                              = FLUSH_NONE
              ; compute_pgm_rsrc2:
                user_sgpr_count                                      = 8
                enable_sgpr_workgroup_id_x                           = TRUE
                enable_sgpr_workgroup_id_y                           = TRUE
                enable_sgpr_workgroup_id_z                           = TRUE
                enable_vgpr_workitem_id                              = X, Y, Z
                granulated_lds_size                                  = 0
              enable_sgpr_private_segment_buffer             = TRUE
              enable_sgpr_dispatch_ptr                       = TRUE
              enable_sgpr_kernarg_segment_ptr                = TRUE
              private_element_size                           = DWORD (4 bytes)
              is_ptr64                                       = TRUE
              gds_segment_byte_size                          = 0
              wavefront_sgpr_count                           = 23
              workitem_vgpr_count                            = 10
              kernarg_segment_byte_size                      = 56
              workgroup_group_segment_byte_size              = 32
              kernarg_segment_alignment                      = 3
              group_segment_alignment                        = 3
              code_alignment                                 = 3
              code_type                                      = HSA_EXT_CODE_KERNEL
              wavefront_size                                 = 6
              optimization_level                             = 2
              hsail_profile                                  = HSA_EXT_BRIG_PROFILE_FULL
              hsail_machine_model                            = HSA_EXT_BRIG_MACHINE_LARGE
              hsail_version_major                            = 0
              hsail_version_minor                            = 20140528
              

               

              Such long code leads to low performance reductions. What is the actual cause of all these instructions? Does it have to do with the supporting ISA of Bonaire? Does it support the so called "swizzle operations" for exchanging data between lanes in the same wavefront? Or is it just a driver bug/limitation?

                • Re: Poor workgroup reduction function performance (OpenCL 2.0)
                  ravkum

                  Thank you. Just wanted to let you know that we are looking into it and will get back to you.

                   

                  Regards,

                  Ravi

                  • Re: Poor workgroup reduction function performance (OpenCL 2.0)
                    dipak

                    Here is an update. A internal bug report has been filed against this issue. Will keep you updated. Thanks for your patience.

                      • Re: Poor workgroup reduction function performance (OpenCL 2.0)
                        ekondis

                        Great!

                        I also tried with the latest driver (15.7). I have to report that though the performance has been improved it does not match the shared memory version. On a Bonaire GPU using partially shared memory and subgroup functions exhibits performance of about 67% of the shared memory only and using the workgroup function shows performance of about a third (34%) of the shared memory only version.

                         

                        Workgroup and sub-workgroup OpenCL 2.0 function evaluation test case
                        Platform/Device selection
                        Total platforms: 2
                        1. NVIDIA CUDA
                                1. GeForce GTX 660/NVIDIA Corporation
                        2. AMD Accelerated Parallel Processing
                                1. Bonaire/Advanced Micro Devices, Inc.
                                2. Intel(R) Pentium(R) 4 CPU 3.06GHz/GenuineIntel
                        Select platform index: 2
                        Select device index: 1
                        
                        
                        Device info
                        Platform:       AMD Accelerated Parallel Processing
                        Device:         Bonaire
                        Driver version: 1800.5 (VM)
                        OpenCL version: OpenCL 2.0 AMD-APP (1800.5)
                        Great! OpenCL 2.0 is supported :)
                        Building kernel with options "-cl-std=CL2.0 -cl-uniform-work-group-size -DK3 -DK2 -DWAVEFRONT_SIZE=64"
                        
                        
                        1. Shared memory only kernel
                        Executing...Done!
                        Output: 4294901760 / Time: 0.0706873 msecs (1.85425 billion elements/second)
                        PASSED!
                        
                        
                        2. Hybrid kernel via subgroup functions
                        Executing...Done!
                        Output: 4294901760 / Time: 0.105353 msecs (1.24413 billion elements/second)
                        Relative speed-up with respect to kernel 1: 0.67096
                        PASSED!
                        
                        
                        3. Workgroup function kernel
                        Executing...Done!
                        Output: 4294901760 / Time: 0.203889 msecs (0.64286 billion elements/second)
                        Relative speed-up with respect to kernel 1: 0.346695
                        PASSED!
                        
                        
                        
                        
                        
                        
                        
                        
                        
                    • Re: Poor workgroup reduction function performance (OpenCL 2.0)
                      pszilard

                      Let me revive this thread. I would have great use of lane shuffle operations exposed, but to most purposes the OpenCL 2.0 workgroup reduction (and some voting ops) would be enough. However, i'd have to be able to trust the implementation do the right thing.

                      Except that even though the OpenCL 2.0 driver is (officially) out of beta, the workgroup funtions' implementation in the OpenCL 2.0 compiler/runtime seems to still be rather poor. With the OPs microbenchmark, for some workload sizes the "hybrid" reduction can be up to 20% faster on Fiji, but for other sizes it's slower. On Fiji the pure wg version seemed to always be 1.5-2x slower than the shmem-based reduction.

                       

                       

                      On Hawaii I get very poor performance, the hybrid version is 2-4x slower, the pure WG version 6-10x slower.

                       

                      I had a hunch and tweaked the hybrid version to let the local memory-based reduction run down to 32 and 16 and do sub_group_reduce_add() across half and quarter vawefront only. Interestingly, this improved the performance on Hawaii; best case I got nearly 2x with sub_group_reduce across a qurter wave on Hawaii with large global work sizes (making Hawaii hierarchical reduction best case 10% slower than local mem-based). On Fiji the same tweaks reduced performance quite a bit.

                       

                      What's strange is that explicit unrolling of the local mem reduction loop causes the performance to regress. Any ideas why?

                       

                      All in all, my results show a pretty uncomfortable reality: the advantages of the OCL 2.0 wg functions are greatly diminished by the quality of the compiler / runtime.

                        • Re: Poor workgroup reduction function performance (OpenCL 2.0)
                          ekondis

                          According to my experiments on a 3rd generation GCN GPU (Tonga) I confirm that subgroup reductions provided slightly improved performance compared to shared mem reduction. However, workgroup reduction is always slower. Maybe workgroup data sharing is not effective without the use of shared memory. Furthermore, using a 2nd gen GCN GPU (Bonaire) it seems it is always faster to use shared memory reductions and by a wider margin.

                           

                          OpenCL is about performance and getting slower performance by using new functions practically makes them meaningless.

                            • Re: Poor workgroup reduction function performance (OpenCL 2.0)
                              pszilard

                              Thanks for the feedback! I guess you used the default global work size hard-coded in the source, right? For a thorough tests it would make sense to generate performance as a function of work size (I did see a slight dependence). However, as long as the implementation (and/or docs) does not improve, the current feedback should be enough incentive for AMD to take action.

                               

                              Have you tried the subgroup reduction across half or quarter wave? That improved performance for me on Hawaii.

                                • Re: Poor workgroup reduction function performance (OpenCL 2.0)
                                  ekondis

                                  Both workgroup and NDrange sizes are hardcoded in the source (256 and 256x64x8=131072, respectively). In order to keep it simple I didn't apply any scaling on the NDRange, just a fixed one. At least this performance behavior is not intolerable on GCN 3rd gen devices.

                                   

                                  I guess that partial reductions on wave are not valid. In paragraph "9.17.3.4" of the opencl-2.0-extensions manual by Khronos it states:

                                  The OpenCL C programming language implements the following built-in functions that operate on a sub-group level. These built-in functions must be encountered by all work-items in a subgroup executing the kernel.

                                  These functions include subgroup reductions.

                              • Re: Poor workgroup reduction function performance (OpenCL 2.0)
                                golgo_13

                                This issue has not escaped our attention, but there is a great deal of competition for that attention.  Also, we don't have quite the flexibility of implementation in the library as do OpenCL kernel developers.  Finally, cross-lane support is not very strong prior to "Volcanic Islands" ASICs.

                                 

                                We are planning to improve our implementation in an upcoming release.

                                  • Re: Poor workgroup reduction function performance (OpenCL 2.0)
                                    ekondis

                                    Thank you for your response.

                                    The experimental results I've got are consistent with your statement on VI ASICs.

                                    Any improvement on the efficiency of workgroup/subgroup functions is welcome.

                                    • Re: Poor workgroup reduction function performance (OpenCL 2.0)
                                      pszilard

                                      > Finally, cross-lane support is not very strong prior to "Volcanic Islands" ASICs.

                                      Could you elaborate what does that mean, e.g. in terms of bytes/clock shifted across SIMD lanes? I can't find much information except some slides that were floating around on the Internet (can't find them right now) that claimed improved lane shuffle in Fiji, I think.

                                       

                                      > We are planning to improve our implementation in an upcoming release.

                                      Improving workgroup reduction to make it at least match the local-memory-based implementations is in my opinion important if not deal-breaker to adoption.

                                      However, in my opinion, you should strongly consider exposing ASAP cross-lane operations through a CL extension. This is a primitive that can allow elegant implementation of many algorithms/parallel primitives. Given how great has been the reception of "warp shuffle" introduced by NV 2 gen ago, I expect that your dev community would applaud and adopt the feature quickly.