Hello, everyone!
Assume we have this simple kernel:
__kernel void test( __global uint2* data )
{
uint gid = get_global_id( 0 );
uint2 a = data[ gid ];
uint2 b = a >> 7;
a = a << (32 - 7);
a.xy = a.xy | b.yx;
data[ gid ] = a;
}
AMD APP Kernel Analyzer shows IL code for that kernel:
;ARGEND:__OpenCL_test_kernel
func 1027 ; test ; @__OpenCL_test_kernel
; BB#0: ; %entry
mov r65, r1021.xyz0
mov r65.x___, r65.x000
ishl r65.x___, r65.x, l11
iadd r65.x___, r1.x, r65.x
uav_raw_load_id(10)_cached_aligned r1011.xy__, r65.x
mov r66.xy__, r1011.xyxy
mov r67.xy__, l13
ushr r68.xy__, r66.xyxy, r67.xyxy
mov r67.x___, r68.x000
mov r68.x___, r68.y000
mov r68.xy__, r68.x
iadd r67.xy__, r68.x000, r67.0x00
mov r68.xy__, l14
ishl r66.xy__, r66.xyxy, r68.xyxy
ior r66.xy__, r66.xyxy, r67.xyxy
uav_raw_store_id(10) mem0.xy__, r65.x, r66.xyxy
ret_dyn
I don't know, why this code is so nonoptimal, because for simple code like this one:
a = a << (32 - 7);
a.xy = a.xy | b.yx;
we get this:
mov r67.x___, r68.x000
mov r68.x___, r68.y000
mov r68.xy__, r68.x
iadd r67.xy__, r68.x000, r67.0x00
mov r68.xy__, l14
ishl r66.xy__, r66.xyxy, r68.xyxy
ior r66.xy__, r66.xyxy, r67.xyxy
instead of this:
ishl r66.xy__, r66.xyxy, l14.xyxy
ior r66.xy__, r66.xyxy, r68.yxyx
So basically, I have two questions:
1) Is this normal behavior for OpenCL compiler? Or should I know something in additional to avoid such instructions spelling?
2) Whether the generated IL code has the same performance as my, manually written IL code in this example?
All this is important to me, because HD 5750 seems to have better IL-2-ASM compiler, than HD 7850: the latter has a lower performance comparing to fair 64-bit rotating, while the former has a better performance.
Best regards, Dmitry.
Solved! Go to Solution.
Neverhood wrote:
coordz wrote:
shader compiler will do an excellent job of getting rid of redundant mov instructions
...
1) This is normal behaviour for the OCL compiler.
2) I suspect the performance will be almost identical between hand written code and this example.
And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...
iadd (the shuffle instruction from LLVM-IR below) should be removed by the shader compiler (finalizer).
LLVM-IR
%tmp4 = lshr <2 x i32> %tmp2, <i32 7, i32 7>
%tmp6 = shl <2 x i32> %tmp2, <i32 25, i32 25>
%tmp16 = shufflevector <2 x i32> %tmp4, <2 x i32> undef, <2 x i32> <i32 1, i32 0>
%tmp17 = or <2 x i32> %tmp6, %tmp16
The shader compiler (SC) has more knowledge about HW capabilities. So it was decided that for better performance the final optimization will be done at the SC level, when the actual ISA is generated.
Tahiti ISA
tbuffer_load_format_xy v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000030: EBD91000 80010100
s_waitcnt vmcnt(0) // 00000038: BF8C1F70
v_lshrrev_b32 v3, 7, v1 // 0000003C: 2C060287
v_lshrrev_b32 v4, 7, v2 // 00000040: 2C080487
v_lshlrev_b32 v1, 25, v1 // 00000044: 34020299
v_lshlrev_b32 v2, 25, v2 // 00000048: 34040499
v_or_b32 v1, v4, v1 // 0000004C: 38020304
v_or_b32 v2, v3, v2 // 00000050: 38040503
tbuffer_store_format_xy v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBDD1000 80010100
Cypress ISA
01 TEX: ADDR(48) CNT(1)
7 VFETCH R0.xy__, R0.z, fc174 FORMAT(32_32_FLOAT) MEGA(8)
FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU: ADDR(41) CNT(7)
8 x: LSHR ____, R0.y, 7
y: LSHR ____, R0.x, 7
z: LSHL ____, R0.y, 25
w: LSHL ____, R0.x, 25
9 x: OR_INT R0.x, PV8.x, PV8.w
y: OR_INT R0.y, PV8.y, PV8.z
03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].xy__, R0, ARRAY_SIZE(4) MARK VPM
Edit -- sorry that was a dumb thing to ask.
IL is generic. I mistook it to be the ISA. Sorry.
I can make an educated guess here.
Since IL is common to all hardware families -- not all optimizations can be applied at the IL level.
IL must be generic enough to be translated to different ISAs.
Possibly, the IL to ISA translation does the final round of optimization.
I agree, that generated IL code is not optimized in most cases.
But if I were compiler, I would just map OpenCL code to IL code without optimization.
Generated IL code for just vector component x and y permutation is excessive. IL specification allows to do that much better. And it would not require any optimization from OpenCL compiler. Just precise mapping one operation to another.
And question from topic title is still open...
I have asked a more knowledge person to take a look at this.
He may be able to answer you. Request you to wait. Thanks for your patience.
From what I know, the shader compiler will do an excellent job of getting rid of redundant mov instructions as this will part of its basic dependency analysis. I also believe SC prefers "loose" IL rather than "tight" IL as it opens up more optimization opportunities and gives it a better insight into what the original code intention was.
To be concrete to your questions:
1) This is normal behaviour for the OCL compiler.
2) I suspect the performance will be almost identical between hand written code and this example.
coordz wrote:
shader compiler will do an excellent job of getting rid of redundant mov instructions
...
1) This is normal behaviour for the OCL compiler.
2) I suspect the performance will be almost identical between hand written code and this example.
And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...
1) If it is true, than it is very strange, for my opinion.
2) My experiments shows, that fair 64-bit rotating is faster, than my code. But if I rewrite code like this:
__kernel void test( __global uint2* data )
{
uint gid = get_global_id( 0 );
uint2 a = data[ gid ];
uint bx = a.x >> 7;
uint by = a.y >> y;
a.x = a.x << (32 - 7);
a.y = a.y << (32 - 7);
a.x = a.x | b.y;
a.y = a.y | b.x;
data[ gid ] = a;
}
it will almost be as fast as fair 64-bit rotating.
As you can see, the logic of code is the same. But that code is more complicated to write on C, than one in my first message.
I hoped OpenCL will do all optimization for me, instead I have to do it my own and not always in obvious way.
Neverhood wrote:
coordz wrote:
shader compiler will do an excellent job of getting rid of redundant mov instructions
...
1) This is normal behaviour for the OCL compiler.
2) I suspect the performance will be almost identical between hand written code and this example.
And what about iadd instruction? If it also will be removed, I'm ok with this code, but if not...
iadd (the shuffle instruction from LLVM-IR below) should be removed by the shader compiler (finalizer).
LLVM-IR
%tmp4 = lshr <2 x i32> %tmp2, <i32 7, i32 7>
%tmp6 = shl <2 x i32> %tmp2, <i32 25, i32 25>
%tmp16 = shufflevector <2 x i32> %tmp4, <2 x i32> undef, <2 x i32> <i32 1, i32 0>
%tmp17 = or <2 x i32> %tmp6, %tmp16
The shader compiler (SC) has more knowledge about HW capabilities. So it was decided that for better performance the final optimization will be done at the SC level, when the actual ISA is generated.
Tahiti ISA
tbuffer_load_format_xy v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000030: EBD91000 80010100
s_waitcnt vmcnt(0) // 00000038: BF8C1F70
v_lshrrev_b32 v3, 7, v1 // 0000003C: 2C060287
v_lshrrev_b32 v4, 7, v2 // 00000040: 2C080487
v_lshlrev_b32 v1, 25, v1 // 00000044: 34020299
v_lshlrev_b32 v2, 25, v2 // 00000048: 34040499
v_or_b32 v1, v4, v1 // 0000004C: 38020304
v_or_b32 v2, v3, v2 // 00000050: 38040503
tbuffer_store_format_xy v[1:2], v0, s[4:7], 0 offen format:[BUF_DATA_FORMAT_32_32,BUF_NUM_FORMAT_FLOAT] // 00000054: EBDD1000 80010100
Cypress ISA
01 TEX: ADDR(48) CNT(1)
7 VFETCH R0.xy__, R0.z, fc174 FORMAT(32_32_FLOAT) MEGA(8)
FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU: ADDR(41) CNT(7)
8 x: LSHR ____, R0.y, 7
y: LSHR ____, R0.x, 7
z: LSHL ____, R0.y, 25
w: LSHL ____, R0.x, 25
9 x: OR_INT R0.x, PV8.x, PV8.w
y: OR_INT R0.y, PV8.y, PV8.z
03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].xy__, R0, ARRAY_SIZE(4) MARK VPM
Thanks, German!
Now I can see there is no additional instructions in ISA.
I wish we would ever have more control over what instructions exactly would be executed on GPU. Like assembler for x86.