8 Replies Latest reply on Feb 7, 2013 2:41 AM by Neverhood

    Does inefficiently generated IL code affect overall performance?

    Neverhood

      Hello, everyone!

       

      Assume we have this simple kernel:

      __kernel void test( __global uint2* data )

      {

          uint gid = get_global_id( 0 );

         

          uint2 a = data[ gid ];

          uint2 b = a >> 7;

          a = a << (32 - 7);

         

          a.xy = a.xy | b.yx;   

          data[ gid ] = a;

      }

       

      AMD APP Kernel Analyzer shows IL code for that kernel:

      ;ARGEND:__OpenCL_test_kernel

      func 1027 ; test                        ; @__OpenCL_test_kernel

      ; BB#0:                                 ; %entry

          mov r65, r1021.xyz0

          mov r65.x___, r65.x000

          ishl r65.x___, r65.x, l11

          iadd r65.x___, r1.x, r65.x

          uav_raw_load_id(10)_cached_aligned r1011.xy__, r65.x

          mov r66.xy__, r1011.xyxy

          mov r67.xy__, l13

          ushr r68.xy__, r66.xyxy, r67.xyxy

          mov r67.x___, r68.x000

          mov r68.x___, r68.y000

          mov r68.xy__, r68.x

          iadd r67.xy__, r68.x000, r67.0x00

          mov r68.xy__, l14

          ishl r66.xy__, r66.xyxy, r68.xyxy

          ior r66.xy__, r66.xyxy, r67.xyxy

          uav_raw_store_id(10) mem0.xy__, r65.x, r66.xyxy

          ret_dyn

       

      I don't know, why this code is so nonoptimal, because for simple code like this one:

          a = a << (32 - 7);

          a.xy = a.xy | b.yx; 

       

      we get this:

          mov r67.x___, r68.x000

          mov r68.x___, r68.y000

          mov r68.xy__, r68.x

          iadd r67.xy__, r68.x000, r67.0x00

          mov r68.xy__, l14

          ishl r66.xy__, r66.xyxy, r68.xyxy

          ior r66.xy__, r66.xyxy, r67.xyxy

       

      instead of this:

          ishl r66.xy__, r66.xyxy, l14.xyxy

          ior r66.xy__, r66.xyxy, r68.yxyx

       

      So basically, I have two questions:

      1) Is this normal behavior for OpenCL compiler? Or should I know something in additional to avoid such instructions spelling?

      2) Whether the generated IL code has the same performance as my, manually written IL code in this example?

       

      All this is important to me, because HD 5750 seems to have better IL-2-ASM compiler, than HD 7850: the latter has a lower performance comparing to fair 64-bit rotating, while the former has a better performance.

       

      Best regards, Dmitry.

        • Re: Does inefficiently generated IL code affect overall performance?
          himanshu.gautam

          Edit -- sorry that was a dumb thing to ask.

          IL is generic. I mistook it to be the ISA. Sorry.

          • Re: Does inefficiently generated IL code affect overall performance?
            himanshu.gautam

            I can make an educated guess here.

             

            Since IL is common to all hardware families -- not all optimizations can be applied at the IL level.

            IL must be generic enough to be translated to different ISAs.

            Possibly, the IL to ISA translation does the final round of optimization.

            • Re: Does inefficiently generated IL code affect overall performance?
              coordz

              From what I know, the shader compiler will do an excellent job of getting rid of redundant mov instructions as this will part of its basic dependency analysis. I also believe SC prefers "loose" IL rather than "tight" IL as it opens up more optimization opportunities and gives it a better insight into what the original code intention was.

               

              To be concrete to your questions:

               

              1) This is normal behaviour for the OCL compiler.

              2) I suspect the performance will be almost identical between hand written code and this example.

                • Re: Does inefficiently generated IL code affect overall performance?
                  Neverhood

                  coordz wrote:

                   

                  shader compiler will do an excellent job of getting rid of redundant mov instructions

                  ...

                  1) This is normal behaviour for the OCL compiler.

                  2) I suspect the performance will be almost identical between hand written code and this example.

                  And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...

                   

                  1) If it is true, than it is very strange, for my opinion.

                  2) My experiments shows, that fair 64-bit rotating is faster, than my code. But if I rewrite code like this:

                  __kernel void test( __global uint2* data )

                  {

                      uint gid = get_global_id( 0 );

                    

                      uint2 a = data[ gid ];

                     

                      uint bx = a.x >> 7;

                      uint by = a.y >> y;

                     

                      a.x = a.x << (32 - 7);

                      a.y = a.y << (32 - 7);

                     

                      a.x = a.x | b.y;

                      a.y = a.y | b.x;

                   

                      data[ gid ] = a;

                  }

                   

                  it will almost be as fast as fair 64-bit rotating.

                   

                  As you can see, the logic of code is the same. But that code is more complicated to write on C, than one in my first message.

                  I hoped OpenCL will do all optimization for me, instead I have to do it my own and not always in obvious way.

                    • Re: Does inefficiently generated IL code affect overall performance?
                      german

                      Neverhood wrote:

                       

                      coordz wrote:

                       

                      shader compiler will do an excellent job of getting rid of redundant mov instructions

                      ...

                      1) This is normal behaviour for the OCL compiler.

                      2) I suspect the performance will be almost identical between hand written code and this example.

                      And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...

                       

                      iadd (the shuffle instruction from LLVM-IR below) should be removed by the shader compiler (finalizer).

                      LLVM-IR

                        %tmp4 = lshr <2 x i32> %tmp2, <i32 7, i32 7>

                        %tmp6 = shl <2 x i32> %tmp2, <i32 25, i32 25>

                      %tmp16 = shufflevector <2 x i32> %tmp4, <2 x i32> undef, <2 x i32> <i32 1, i32 0>

                        %tmp17 = or <2 x i32> %tmp6, %tmp16

                       

                      The shader compiler (SC) has more knowledge about HW capabilities. So it was decided that for better performance the final optimization will be done at the SC level, when the actual ISA is generated.

                       

                      Tahiti ISA
                        tbuffer_load_format_xy  v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000030: EBD91000 80010100
                        s_waitcnt     vmcnt(0)                                    // 00000038: BF8C1F70
                        v_lshrrev_b32  v3, 7, v1                                  // 0000003C: 2C060287
                        v_lshrrev_b32  v4, 7, v2                                  // 00000040: 2C080487
                        v_lshlrev_b32  v1, 25, v1                                 // 00000044: 34020299
                        v_lshlrev_b32  v2, 25, v2                                 // 00000048: 34040499
                        v_or_b32      v1, v4, v1                                  // 0000004C: 38020304
                        v_or_b32      v2, v3, v2                                  // 00000050: 38040503
                        tbuffer_store_format_xy  v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBDD1000 80010100

                      Cypress ISA
                      01 TEX: ADDR(48) CNT(1)
                            7  VFETCH R0.xy__, R0.z, fc174  FORMAT(32_32_FLOAT) MEGA(8)
                               FETCH_TYPE(NO_INDEX_OFFSET)
                      02 ALU: ADDR(41) CNT(7)
                            8  x: LSHR        ____,  R0.y,  7     
                               y: LSHR        ____,  R0.x,  7     
                               z: LSHL        ____,  R0.y,  25     
                               w: LSHL        ____,  R0.x,  25     
                            9  x: OR_INT      R0.x,  PV8.x,  PV8.w     
                               y: OR_INT      R0.y,  PV8.y,  PV8.z     
                      03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].xy__, R0, ARRAY_SIZE(4)  MARK  VPM