cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

hazeman
Adept II

Failed optimization ?

restrict keyword ( or -fno-alias ) has no impact on ISA

Restrict keyword is supposed to enable cached reads from global memory.

But lets look at the attached kernel. Whether we use rescrict keyword or not there is no difference in produced ISA code. At the moment the only difference is that IL code has "_cached" added to uavs but it has no real impact on kernel execution.

Why kernel with restrict keyword does have to wait for write completion ( WAIT_ACK: Outstanding acks<=0 ). Is it failed optimization ? Can we expect this fixed someday or is it required by hardware ( = failed hardware design ) ?

PS. Compiled with driver  11.3, 11.4-preview. The problem is visible in every kernel I've tested.

PS2. -fno-alias gives the same results.

__kernel void test( __global float4 const* restrict in , __global float4* out ) { unsigned int gid = get_global_id(0); out[2*gid] = in[gid]; out[gid] = in[2*gid]; } ;-------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(17) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 x: MULLO_INT ____, R1.x, KC0[1].x y: MULLO_INT ____, R1.x, KC0[1].x z: MULLO_INT ____, R1.x, KC0[1].x w: MULLO_INT R0.w, R1.x, KC0[1].x 1 z: ADD_INT R0.z, R0.x, PV0.w 2 y: ADD_INT R0.y, PV1.z, KC0[6].x 3 x: LSHL R0.x, PV2.y, (0x00000005, 7.006492322e-45f).x z: LSHL R0.z, PV2.y, (0x00000004, 5.605193857e-45f).y 4 x: ADD_INT R0.x, KC1[0].x, PV3.x y: ADD_INT R0.y, KC1[0].x, PV3.z z: ADD_INT R0.z, KC1[1].x, PV3.z w: ADD_INT R0.w, KC1[1].x, PV3.x 5 x: LSHR R0.x, PV4.w, (0x00000002, 2.802596929e-45f).x y: LSHR R0.y, PV4.x, (0x00000004, 5.605193857e-45f).y w: LSHR R0.w, PV4.y, (0x00000004, 5.605193857e-45f).y 01 TEX: ADDR(64) CNT(1) 6 VFETCH R1, R0.w, fc175 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 02 MEM_RAT_CACHELESS_STORE_DWORD__NI_ACK: RAT(11)[R0], R1, ARRAY_SIZE(4) MARK VPM 03 WAIT_ACK: Outstanding_acks <= 0 04 TEX: ADDR(66) CNT(1) 7 VFETCH R1, R0.y, fc175 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 05 ALU: ADDR(49) CNT(2) 8 x: LSHR R0.x, R0.z, (0x00000002, 2.802596929e-45f).x 06 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0], R1, ARRAY_SIZE(4) MARK VPM 07 END END_OF_PROGRAM

Tags (1)
0 Likes
10 Replies
nou
Exemplar

Failed optimization ?

it should be const restrict. not only restrict. and i -no-alias works.

0 Likes
hazeman
Adept II

Failed optimization ?

Originally posted by: nou it should be const restrict. not only restrict. and i -no-alias works.


Next time please don't respond if you didn't read the thread. It's called trolling.

If you would read attached source you would notice the const keyword is present ! I know that opencl -> IL compiler uses this flag . It adds "_cached" to all required uavs. But IL->ISA compiler failes to use it properly and in effect we have the same code with no optimization !!!

0 Likes
nou
Exemplar

Failed optimization ?

ok sorry i overloked it. but i tryed patch Luxrays to use -fno-alias and it result into 5% performace improvment.

0 Likes
hazeman
Adept II

Failed optimization ?

Originally posted by: nou ok sorry i overloked it. but i tryed patch Luxrays to use -fno-alias and it result into 5% performace improvment.

 

OpenCL->IL compiler does some optimizations with restricted keyword. Like in the kernel

__kernel void test( __global float4 const* restrict in , __global float4* out 
{
    unsigned int gid = get_global_id(0);

    out[gid] = in[gid];
    out[2*gid]   = in[gid];
}

it would read in[gid] only once and reuse this value. But it isn't "cached reads".

0 Likes
MicahVillmow
Staff
Staff

Failed optimization ?

hazeman,
You are getting caching on your read.
The difference between a cached and uncached read is the fetch constant 175 versus fc173.
This is a cached read:
01 TEX: ADDR(64) CNT(1)
6 VFETCH R1, R0.w, fc175 FORMAT(32_32_32_32_FLOAT)
FETCH_TYPE(NO_INDEX_OFFSET)

This is an uncached read:
01 TEX: ADDR(64) CNT(1)
6 VFETCH R1, R0.w, fc173 FORMAT(32_32_32_32_FLOAT)
FETCH_TYPE(NO_INDEX_OFFSET)


Also, this is what I get with internal compiler(i.e. next Catalyst release):
; -------- Disassembly --------------------
00 ALU: ADDR(32) CNT(15) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)
0 x: MULLO_INT ____, R1.x, KC0[1].x
y: MULLO_INT R0.y, R1.x, KC0[1].x
z: MULLO_INT ____, R1.x, KC0[1].x
w: MULLO_INT ____, R1.x, KC0[1].x
1 x: ADD_INT R1.x, R0.x, PV0.y
2 w: ADD_INT R0.w, PV1.x, KC0[6].x
3 x: LSHL R0.x, PV2.w, (0x00000005, 7.006492322e-45f).x
z: LSHL R0.z, PV2.w, (0x00000004, 5.605193857e-45f).y
4 x: ADD_INT R0.x, KC1[0].x, PV3.z
y: ADD_INT R0.y, KC1[1].x, PV3.z
z: ADD_INT R0.z, KC1[1].x, PV3.x
5 x: LSHR R0.x, PV4.y, (0x00000002, 2.802596929e-45f).x
w: LSHR R0.w, PV4.x, (0x00000004, 5.605193857e-45f).y
01 TEX: ADDR(64) CNT(1)
6 VFETCH R1, R0.w, fc175 FORMAT(32_32_32_32_FLOAT)
FETCH_TYPE(NO_INDEX_OFFSET)
02 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0], R1, ARRAY_SIZE(4) MARK VPM
03 ALU: ADDR(47) CNT(2)
7 x: LSHR R0.x, R0.z, (0x00000002, 2.802596929e-45f).x
04 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0], R1, ARRAY_SIZE(4) MARK VPM
05 END
END_OF_PROGRAM


The reason is that there is a bug in 2.4 which wasn't fix in time for release, so expect it with a driver update.
0 Likes
hazeman
Adept II

Failed optimization ?

Thanks Micah for answer. But i think you posted disassembly of second kernel ( with only 1 read ).

If you could please post result of compilation of first kernel ( out[2*gid] = in[gid]; out[gid] = in[2*gid]; ). I hope new driver doesn't generate WAIT_ACK there ( it's really killing performance even if the reads use cache ).

0 Likes
MicahVillmow
Staff
Staff

Failed optimization ?

hazeman,
The ack still exists, this is outside of caching issue. The problem is a software limitation that isn't expected to be fixed anytime soon. A workaround is to not interleave reads and writes. This code does not have any ack's.
__kernel void test( __global float4 const* restrict in , __global float4* out)
{
unsigned int gid = get_global_id(0);
float4 a = in[gid];
float4 b = in[2*gid];
out[gid] = a;
out[2*gid] = b;
}
0 Likes
hazeman
Adept II

Failed optimization ?

You know "fixing" example which was designed only to show the problem is quite easy ...

But in real world usually you can't move writes to the end of kernel. For example, one of my kernels multiplies 1 matrix with >200 vectors. WAIT_ACK simply kills the performance ( >10x slower than version using TU ). And this kind of bugs really make working with AMD cards frustrating.

With the state of OpenCL implementation I really wonder why AMD/ATI thinks of dropping CAL support. The only efficient kernels are written on top of CAL. AMD OpenCL is simply useless now for any high performance code.

 

0 Likes
MicahVillmow
Staff
Staff

Failed optimization ?

hazeman,
I understand the frustration. The problem is that AMD IL has a maximum of 12 UAV's and OpenCL allows pretty much unlimited pointers, and the mapping for that causes information to be lost in the translation between OpenCL to IL. The reason why you see the ACK's is because both pointers go to the same UAV and the shader compiler does not know that they point to different memory objects. This is why the 'ack's are introduced because the compiler has to keep memory coherent. There are workarounds for this problem for specific cases, but not a general solution, as it is a fundamental issue with the AMDIL language.
0 Likes