Archives Discussions

Neverhood · ‎02-04-2013

Hello, everyone!

Assume we have this simple kernel:

__kernel void test( __global uint2* data )

{

uint gid = get_global_id( 0 );

uint2 a = data[ gid ];

uint2 b = a >> 7;

a = a << (32 - 7);

a.xy = a.xy | b.yx;

data[ gid ] = a;

}

AMD APP Kernel Analyzer shows IL code for that kernel:

;ARGEND:__OpenCL_test_kernel

func 1027 ; test ; @__OpenCL_test_kernel

; BB#0: ; %entry

mov r65, r1021.xyz0

mov r65.x___, r65.x000

ishl r65.x___, r65.x, l11

iadd r65.x___, r1.x, r65.x

uav_raw_load_id(10)_cached_aligned r1011.xy__, r65.x

mov r66.xy__, r1011.xyxy

mov r67.xy__, l13

ushr r68.xy__, r66.xyxy, r67.xyxy

mov r67.x___, r68.x000

mov r68.x___, r68.y000

mov r68.xy__, r68.x

iadd r67.xy__, r68.x000, r67.0x00

mov r68.xy__, l14

ishl r66.xy__, r66.xyxy, r68.xyxy

ior r66.xy__, r66.xyxy, r67.xyxy

uav_raw_store_id(10) mem0.xy__, r65.x, r66.xyxy

ret_dyn

I don't know, why this code is so nonoptimal, because for simple code like this one:

a = a << (32 - 7);

a.xy = a.xy | b.yx;

we get this:

mov r67.x___, r68.x000

mov r68.x___, r68.y000

mov r68.xy__, r68.x

iadd r67.xy__, r68.x000, r67.0x00

mov r68.xy__, l14

ishl r66.xy__, r66.xyxy, r68.xyxy

ior r66.xy__, r66.xyxy, r67.xyxy

instead of this:

ishl r66.xy__, r66.xyxy, l14.xyxy

ior r66.xy__, r66.xyxy, r68.yxyx

So basically, I have two questions:

1) Is this normal behavior for OpenCL compiler? Or should I know something in additional to avoid such instructions spelling?

2) Whether the generated IL code has the same performance as my, manually written IL code in this example?

All this is important to me, because HD 5750 seems to have better IL-2-ASM compiler, than HD 7850: the latter has a lower performance comparing to fair 64-bit rotating, while the former has a better performance.

Best regards, Dmitry.

german · ‎02-06-2013

Neverhood wrote:
coordz wrote:
shader compiler will do an excellent job of getting rid of redundant mov instructions
...
1) This is normal behaviour for the OCL compiler.
2) I suspect the performance will be almost identical between hand written code and this example.
And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...

iadd (the shuffle instruction from LLVM-IR below) should be removed by the shader compiler (finalizer).

LLVM-IR

%tmp4 = lshr <2 x i32> %tmp2, <i32 7, i32 7>

%tmp6 = shl <2 x i32> %tmp2, <i32 25, i32 25>

%tmp16 = shufflevector <2 x i32> %tmp4, <2 x i32> undef, <2 x i32> <i32 1, i32 0>

%tmp17 = or <2 x i32> %tmp6, %tmp16

The shader compiler (SC) has more knowledge about HW capabilities. So it was decided that for better performance the final optimization will be done at the SC level, when the actual ISA is generated.

Tahiti ISA
tbuffer_load_format_xy v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000030: EBD91000 80010100
s_waitcnt     vmcnt(0)                                    // 00000038: BF8C1F70
v_lshrrev_b32 v3, 7, v1                                  // 0000003C: 2C060287
v_lshrrev_b32 v4, 7, v2                                  // 00000040: 2C080487
v_lshlrev_b32 v1, 25, v1                                 // 00000044: 34020299
v_lshlrev_b32 v2, 25, v2                                 // 00000048: 34040499
v_or_b32      v1, v4, v1                                  // 0000004C: 38020304
v_or_b32      v2, v3, v2                                  // 00000050: 38040503
tbuffer_store_format_xy v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBDD1000 80010100

Cypress ISA
01 TEX: ADDR(48) CNT(1)
      7 VFETCH R0.xy__, R0.z, fc174 FORMAT(32_32_FLOAT) MEGA(8)
         FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU: ADDR(41) CNT(7)
      8 x: LSHR        ____, R0.y, 7
         y: LSHR        ____, R0.x, 7
         z: LSHL        ____, R0.y, 25
         w: LSHL        ____, R0.x, 25
      9 x: OR_INT      R0.x, PV8.x, PV8.w
         y: OR_INT      R0.y, PV8.y, PV8.z
03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].xy__, R0, ARRAY_SIZE(4) MARK VPM

View solution in original post

himanshu_gautam · ‎02-04-2013

Edit -- sorry that was a dumb thing to ask.

IL is generic. I mistook it to be the ISA. Sorry.

himanshu_gautam · ‎02-05-2013

I can make an educated guess here.

Since IL is common to all hardware families -- not all optimizations can be applied at the IL level.

IL must be generic enough to be translated to different ISAs.

Possibly, the IL to ISA translation does the final round of optimization.

Neverhood · ‎02-05-2013

I agree, that generated IL code is not optimized in most cases.

But if I were compiler, I would just map OpenCL code to IL code without optimization.

Generated IL code for just vector component x and y permutation is excessive. IL specification allows to do that much better. And it would not require any optimization from OpenCL compiler. Just precise mapping one operation to another.

And question from topic title is still open...

himanshu_gautam · ‎02-05-2013

I have asked a more knowledge person to take a look at this.
He may be able to answer you. Request you to wait. Thanks for your patience.

coordz · ‎02-06-2013

From what I know, the shader compiler will do an excellent job of getting rid of redundant mov instructions as this will part of its basic dependency analysis. I also believe SC prefers "loose" IL rather than "tight" IL as it opens up more optimization opportunities and gives it a better insight into what the original code intention was.

To be concrete to your questions:

1) This is normal behaviour for the OCL compiler.

2) I suspect the performance will be almost identical between hand written code and this example.

Neverhood · ‎02-06-2013

coordz wrote:
shader compiler will do an excellent job of getting rid of redundant mov instructions
...
1) This is normal behaviour for the OCL compiler.
2) I suspect the performance will be almost identical between hand written code and this example.

And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...

1) If it is true, than it is very strange, for my opinion.

2) My experiments shows, that fair 64-bit rotating is faster, than my code. But if I rewrite code like this:

__kernel void test( __global uint2* data )

{

uint gid = get_global_id( 0 );

uint2 a = data[ gid ];

uint bx = a.x >> 7;

uint by = a.y >> y;

a.x = a.x << (32 - 7);

a.y = a.y << (32 - 7);

a.x = a.x | b.y;

a.y = a.y | b.x;

data[ gid ] = a;

}

it will almost be as fast as fair 64-bit rotating.

As you can see, the logic of code is the same. But that code is more complicated to write on C, than one in my first message.

I hoped OpenCL will do all optimization for me, instead I have to do it my own and not always in obvious way.

german · ‎02-06-2013

Neverhood wrote:
coordz wrote:
shader compiler will do an excellent job of getting rid of redundant mov instructions
...
1) This is normal behaviour for the OCL compiler.
2) I suspect the performance will be almost identical between hand written code and this example.
And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...

iadd (the shuffle instruction from LLVM-IR below) should be removed by the shader compiler (finalizer).

LLVM-IR

%tmp4 = lshr <2 x i32> %tmp2, <i32 7, i32 7>

%tmp6 = shl <2 x i32> %tmp2, <i32 25, i32 25>

%tmp16 = shufflevector <2 x i32> %tmp4, <2 x i32> undef, <2 x i32> <i32 1, i32 0>

%tmp17 = or <2 x i32> %tmp6, %tmp16

The shader compiler (SC) has more knowledge about HW capabilities. So it was decided that for better performance the final optimization will be done at the SC level, when the actual ISA is generated.

Tahiti ISA
tbuffer_load_format_xy v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000030: EBD91000 80010100
s_waitcnt     vmcnt(0)                                    // 00000038: BF8C1F70
v_lshrrev_b32 v3, 7, v1                                  // 0000003C: 2C060287
v_lshrrev_b32 v4, 7, v2                                  // 00000040: 2C080487
v_lshlrev_b32 v1, 25, v1                                 // 00000044: 34020299
v_lshlrev_b32 v2, 25, v2                                 // 00000048: 34040499
v_or_b32      v1, v4, v1                                  // 0000004C: 38020304
v_or_b32      v2, v3, v2                                  // 00000050: 38040503
tbuffer_store_format_xy v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBDD1000 80010100

Cypress ISA
01 TEX: ADDR(48) CNT(1)
      7 VFETCH R0.xy__, R0.z, fc174 FORMAT(32_32_FLOAT) MEGA(8)
         FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU: ADDR(41) CNT(7)
      8 x: LSHR        ____, R0.y, 7
         y: LSHR        ____, R0.x, 7
         z: LSHL        ____, R0.y, 25
         w: LSHL        ____, R0.x, 25
      9 x: OR_INT      R0.x, PV8.x, PV8.w
         y: OR_INT      R0.y, PV8.y, PV8.z
03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].xy__, R0, ARRAY_SIZE(4) MARK VPM

Neverhood · ‎02-07-2013

Thanks, German!

Now I can see there is no additional instructions in ISA.

I wish we would ever have more control over what instructions exactly would be executed on GPU. Like assembler for x86.

Archives Discussions

Does inefficiently generated IL code affect overall performance?