cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Neverhood
Adept II

Does inefficiently generated IL code affect overall performance?

Hello, everyone!

Assume we have this simple kernel:

__kernel void test( __global uint2* data )

{

    uint gid = get_global_id( 0 );

   

    uint2 a = data[ gid ];

    uint2 b = a >> 7;

    a = a << (32 - 7);

   

    a.xy = a.xy | b.yx;   

    data[ gid ] = a;

}

AMD APP Kernel Analyzer shows IL code for that kernel:

;ARGEND:__OpenCL_test_kernel

func 1027 ; test                        ; @__OpenCL_test_kernel

; BB#0:                                 ; %entry

    mov r65, r1021.xyz0

    mov r65.x___, r65.x000

    ishl r65.x___, r65.x, l11

    iadd r65.x___, r1.x, r65.x

    uav_raw_load_id(10)_cached_aligned r1011.xy__, r65.x

    mov r66.xy__, r1011.xyxy

    mov r67.xy__, l13

    ushr r68.xy__, r66.xyxy, r67.xyxy

    mov r67.x___, r68.x000

    mov r68.x___, r68.y000

    mov r68.xy__, r68.x

    iadd r67.xy__, r68.x000, r67.0x00

    mov r68.xy__, l14

    ishl r66.xy__, r66.xyxy, r68.xyxy

    ior r66.xy__, r66.xyxy, r67.xyxy

    uav_raw_store_id(10) mem0.xy__, r65.x, r66.xyxy

    ret_dyn

I don't know, why this code is so nonoptimal, because for simple code like this one:

    a = a << (32 - 7);

    a.xy = a.xy | b.yx; 

we get this:

    mov r67.x___, r68.x000

    mov r68.x___, r68.y000

    mov r68.xy__, r68.x

    iadd r67.xy__, r68.x000, r67.0x00

    mov r68.xy__, l14

    ishl r66.xy__, r66.xyxy, r68.xyxy

    ior r66.xy__, r66.xyxy, r67.xyxy

instead of this:

    ishl r66.xy__, r66.xyxy, l14.xyxy

    ior r66.xy__, r66.xyxy, r68.yxyx

So basically, I have two questions:

1) Is this normal behavior for OpenCL compiler? Or should I know something in additional to avoid such instructions spelling?

2) Whether the generated IL code has the same performance as my, manually written IL code in this example?

All this is important to me, because HD 5750 seems to have better IL-2-ASM compiler, than HD 7850: the latter has a lower performance comparing to fair 64-bit rotating, while the former has a better performance.

Best regards, Dmitry.

0 Likes
1 Solution

Neverhood wrote:

coordz wrote:

shader compiler will do an excellent job of getting rid of redundant mov instructions

...

1) This is normal behaviour for the OCL compiler.

2) I suspect the performance will be almost identical between hand written code and this example.

And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...

iadd (the shuffle instruction from LLVM-IR below) should be removed by the shader compiler (finalizer).

LLVM-IR

  %tmp4 = lshr <2 x i32> %tmp2, <i32 7, i32 7>

  %tmp6 = shl <2 x i32> %tmp2, <i32 25, i32 25>

%tmp16 = shufflevector <2 x i32> %tmp4, <2 x i32> undef, <2 x i32> <i32 1, i32 0>

  %tmp17 = or <2 x i32> %tmp6, %tmp16

The shader compiler (SC) has more knowledge about HW capabilities. So it was decided that for better performance the final optimization will be done at the SC level, when the actual ISA is generated.

Tahiti ISA
  tbuffer_load_format_xy  v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000030: EBD91000 80010100
  s_waitcnt     vmcnt(0)                                    // 00000038: BF8C1F70
  v_lshrrev_b32  v3, 7, v1                                  // 0000003C: 2C060287
  v_lshrrev_b32  v4, 7, v2                                  // 00000040: 2C080487
  v_lshlrev_b32  v1, 25, v1                                 // 00000044: 34020299
  v_lshlrev_b32  v2, 25, v2                                 // 00000048: 34040499
  v_or_b32      v1, v4, v1                                  // 0000004C: 38020304
  v_or_b32      v2, v3, v2                                  // 00000050: 38040503
  tbuffer_store_format_xy  v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBDD1000 80010100

Cypress ISA
01 TEX: ADDR(48) CNT(1)
      7  VFETCH R0.xy__, R0.z, fc174  FORMAT(32_32_FLOAT) MEGA(8)
         FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU: ADDR(41) CNT(7)
      8  x: LSHR        ____,  R0.y,  7     
         y: LSHR        ____,  R0.x,  7     
         z: LSHL        ____,  R0.y,  25     
         w: LSHL        ____,  R0.x,  25     
      9  x: OR_INT      R0.x,  PV8.x,  PV8.w     
         y: OR_INT      R0.y,  PV8.y,  PV8.z     
03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].xy__, R0, ARRAY_SIZE(4)  MARK  VPM 

View solution in original post

0 Likes
8 Replies
himanshu_gautam
Grandmaster

Edit -- sorry that was a dumb thing to ask.

IL is generic. I mistook it to be the ISA. Sorry.

0 Likes
himanshu_gautam
Grandmaster

I can make an educated guess here.

Since IL is common to all hardware families -- not all optimizations can be applied at the IL level.

IL must be generic enough to be translated to different ISAs.

Possibly, the IL to ISA translation does the final round of optimization.

0 Likes

I agree, that generated IL code is not optimized in most cases.

But if I were compiler, I would just map OpenCL code to IL code without optimization.

Generated IL code for just vector component x and y permutation is excessive. IL specification allows to do that much better. And it would not require any optimization from OpenCL compiler. Just precise mapping one operation to another.

And question from topic title is still open...

0 Likes

I have asked a more knowledge person to take a look at this.
He may be able to answer you. Request you to wait. Thanks for your patience.

0 Likes
coordz
Adept II

From what I know, the shader compiler will do an excellent job of getting rid of redundant mov instructions as this will part of its basic dependency analysis. I also believe SC prefers "loose" IL rather than "tight" IL as it opens up more optimization opportunities and gives it a better insight into what the original code intention was.

To be concrete to your questions:

1) This is normal behaviour for the OCL compiler.

2) I suspect the performance will be almost identical between hand written code and this example.

0 Likes

coordz wrote:

shader compiler will do an excellent job of getting rid of redundant mov instructions

...

1) This is normal behaviour for the OCL compiler.

2) I suspect the performance will be almost identical between hand written code and this example.

And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...

1) If it is true, than it is very strange, for my opinion.

2) My experiments shows, that fair 64-bit rotating is faster, than my code. But if I rewrite code like this:

__kernel void test( __global uint2* data )

{

    uint gid = get_global_id( 0 );

  

    uint2 a = data[ gid ];

   

    uint bx = a.x >> 7;

    uint by = a.y >> y;

   

    a.x = a.x << (32 - 7);

    a.y = a.y << (32 - 7);

   

    a.x = a.x | b.y;

    a.y = a.y | b.x;

    data[ gid ] = a;

}

it will almost be as fast as fair 64-bit rotating.

As you can see, the logic of code is the same. But that code is more complicated to write on C, than one in my first message.

I hoped OpenCL will do all optimization for me, instead I have to do it my own and not always in obvious way.

0 Likes

Neverhood wrote:

coordz wrote:

shader compiler will do an excellent job of getting rid of redundant mov instructions

...

1) This is normal behaviour for the OCL compiler.

2) I suspect the performance will be almost identical between hand written code and this example.

And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...

iadd (the shuffle instruction from LLVM-IR below) should be removed by the shader compiler (finalizer).

LLVM-IR

  %tmp4 = lshr <2 x i32> %tmp2, <i32 7, i32 7>

  %tmp6 = shl <2 x i32> %tmp2, <i32 25, i32 25>

%tmp16 = shufflevector <2 x i32> %tmp4, <2 x i32> undef, <2 x i32> <i32 1, i32 0>

  %tmp17 = or <2 x i32> %tmp6, %tmp16

The shader compiler (SC) has more knowledge about HW capabilities. So it was decided that for better performance the final optimization will be done at the SC level, when the actual ISA is generated.

Tahiti ISA
  tbuffer_load_format_xy  v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000030: EBD91000 80010100
  s_waitcnt     vmcnt(0)                                    // 00000038: BF8C1F70
  v_lshrrev_b32  v3, 7, v1                                  // 0000003C: 2C060287
  v_lshrrev_b32  v4, 7, v2                                  // 00000040: 2C080487
  v_lshlrev_b32  v1, 25, v1                                 // 00000044: 34020299
  v_lshlrev_b32  v2, 25, v2                                 // 00000048: 34040499
  v_or_b32      v1, v4, v1                                  // 0000004C: 38020304
  v_or_b32      v2, v3, v2                                  // 00000050: 38040503
  tbuffer_store_format_xy  v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBDD1000 80010100

Cypress ISA
01 TEX: ADDR(48) CNT(1)
      7  VFETCH R0.xy__, R0.z, fc174  FORMAT(32_32_FLOAT) MEGA(8)
         FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU: ADDR(41) CNT(7)
      8  x: LSHR        ____,  R0.y,  7     
         y: LSHR        ____,  R0.x,  7     
         z: LSHL        ____,  R0.y,  25     
         w: LSHL        ____,  R0.x,  25     
      9  x: OR_INT      R0.x,  PV8.x,  PV8.w     
         y: OR_INT      R0.y,  PV8.y,  PV8.z     
03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].xy__, R0, ARRAY_SIZE(4)  MARK  VPM 

0 Likes

Thanks, German!

Now I can see there is no additional instructions in ISA.

I wish we would ever have more control over what instructions exactly would be executed on GPU. Like assembler for x86.

0 Likes