AnsweredAssumed Answered

Does inefficiently generated IL code affect overall performance?

Question asked by Neverhood on Feb 4, 2013
Latest reply on Feb 7, 2013 by Neverhood

Hello, everyone!

 

Assume we have this simple kernel:

__kernel void test( __global uint2* data )

{

    uint gid = get_global_id( 0 );

   

    uint2 a = data[ gid ];

    uint2 b = a >> 7;

    a = a << (32 - 7);

   

    a.xy = a.xy | b.yx;   

    data[ gid ] = a;

}

 

AMD APP Kernel Analyzer shows IL code for that kernel:

;ARGEND:__OpenCL_test_kernel

func 1027 ; test                        ; @__OpenCL_test_kernel

; BB#0:                                 ; %entry

    mov r65, r1021.xyz0

    mov r65.x___, r65.x000

    ishl r65.x___, r65.x, l11

    iadd r65.x___, r1.x, r65.x

    uav_raw_load_id(10)_cached_aligned r1011.xy__, r65.x

    mov r66.xy__, r1011.xyxy

    mov r67.xy__, l13

    ushr r68.xy__, r66.xyxy, r67.xyxy

    mov r67.x___, r68.x000

    mov r68.x___, r68.y000

    mov r68.xy__, r68.x

    iadd r67.xy__, r68.x000, r67.0x00

    mov r68.xy__, l14

    ishl r66.xy__, r66.xyxy, r68.xyxy

    ior r66.xy__, r66.xyxy, r67.xyxy

    uav_raw_store_id(10) mem0.xy__, r65.x, r66.xyxy

    ret_dyn

 

I don't know, why this code is so nonoptimal, because for simple code like this one:

    a = a << (32 - 7);

    a.xy = a.xy | b.yx; 

 

we get this:

    mov r67.x___, r68.x000

    mov r68.x___, r68.y000

    mov r68.xy__, r68.x

    iadd r67.xy__, r68.x000, r67.0x00

    mov r68.xy__, l14

    ishl r66.xy__, r66.xyxy, r68.xyxy

    ior r66.xy__, r66.xyxy, r67.xyxy

 

instead of this:

    ishl r66.xy__, r66.xyxy, l14.xyxy

    ior r66.xy__, r66.xyxy, r68.yxyx

 

So basically, I have two questions:

1) Is this normal behavior for OpenCL compiler? Or should I know something in additional to avoid such instructions spelling?

2) Whether the generated IL code has the same performance as my, manually written IL code in this example?

 

All this is important to me, because HD 5750 seems to have better IL-2-ASM compiler, than HD 7850: the latter has a lower performance comparing to fair 64-bit rotating, while the former has a better performance.

 

Best regards, Dmitry.

Outcomes