Restrict keyword is supposed to enable cached reads from global memory.
But lets look at the attached kernel. Whether we use rescrict keyword or not there is no difference in produced ISA code. At the moment the only difference is that IL code has "_cached" added to uavs but it has no real impact on kernel execution.
Why kernel with restrict keyword does have to wait for write completion ( WAIT_ACK: Outstanding acks<=0 ). Is it failed optimization ? Can we expect this fixed someday or is it required by hardware ( = failed hardware design ) ?
PS. Compiled with driver 11.3, 11.4-preview. The problem is visible in every kernel I've tested.
PS2. -fno-alias gives the same results.
__kernel void test( __global float4 const* restrict in , __global float4* out ) { unsigned int gid = get_global_id(0); out[2*gid] = in[gid]; out[gid] = in[2*gid]; } ;-------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(17) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 x: MULLO_INT ____, R1.x, KC0[1].x y: MULLO_INT ____, R1.x, KC0[1].x z: MULLO_INT ____, R1.x, KC0[1].x w: MULLO_INT R0.w, R1.x, KC0[1].x 1 z: ADD_INT R0.z, R0.x, PV0.w 2 y: ADD_INT R0.y, PV1.z, KC0[6].x 3 x: LSHL R0.x, PV2.y, (0x00000005, 7.006492322e-45f).x z: LSHL R0.z, PV2.y, (0x00000004, 5.605193857e-45f).y 4 x: ADD_INT R0.x, KC1[0].x, PV3.x y: ADD_INT R0.y, KC1[0].x, PV3.z z: ADD_INT R0.z, KC1[1].x, PV3.z w: ADD_INT R0.w, KC1[1].x, PV3.x 5 x: LSHR R0.x, PV4.w, (0x00000002, 2.802596929e-45f).x y: LSHR R0.y, PV4.x, (0x00000004, 5.605193857e-45f).y w: LSHR R0.w, PV4.y, (0x00000004, 5.605193857e-45f).y 01 TEX: ADDR(64) CNT(1) 6 VFETCH R1, R0.w, fc175 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 02 MEM_RAT_CACHELESS_STORE_DWORD__NI_ACK: RAT(11)[R0], R1, ARRAY_SIZE(4) MARK VPM 03 WAIT_ACK: Outstanding_acks <= 0 04 TEX: ADDR(66) CNT(1) 7 VFETCH R1, R0.y, fc175 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 05 ALU: ADDR(49) CNT(2) 8 x: LSHR R0.x, R0.z, (0x00000002, 2.802596929e-45f).x 06 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0], R1, ARRAY_SIZE(4) MARK VPM 07 END END_OF_PROGRAM
it should be const restrict. not only restrict. and i -no-alias works.
Originally posted by: nou it should be const restrict. not only restrict. and i -no-alias works.
Next time please don't respond if you didn't read the thread. It's called trolling.
If you would read attached source you would notice the const keyword is present ! I know that opencl -> IL compiler uses this flag . It adds "_cached" to all required uavs. But IL->ISA compiler failes to use it properly and in effect we have the same code with no optimization !!!
ok sorry i overloked it. but i tryed patch Luxrays to use -fno-alias and it result into 5% performace improvment.
Originally posted by: nou ok sorry i overloked it. but i tryed patch Luxrays to use -fno-alias and it result into 5% performace improvment.
OpenCL->IL compiler does some optimizations with restricted keyword. Like in the kernel
__kernel void test( __global float4 const* restrict in , __global float4* out
{
unsigned int gid = get_global_id(0);
out[gid] = in[gid];
out[2*gid] = in[gid];
}
it would read in[gid] only once and reuse this value. But it isn't "cached reads".
Thanks Micah for answer. But i think you posted disassembly of second kernel ( with only 1 read ).
If you could please post result of compilation of first kernel ( out[2*gid] = in[gid]; out[gid] = in[2*gid]; ). I hope new driver doesn't generate WAIT_ACK there ( it's really killing performance even if the reads use cache ).
You know "fixing" example which was designed only to show the problem is quite easy ...
But in real world usually you can't move writes to the end of kernel. For example, one of my kernels multiplies 1 matrix with >200 vectors. WAIT_ACK simply kills the performance ( >10x slower than version using TU ). And this kind of bugs really make working with AMD cards frustrating.
With the state of OpenCL implementation I really wonder why AMD/ATI thinks of dropping CAL support. The only efficient kernels are written on top of CAL. AMD OpenCL is simply useless now for any high performance code.