10 Replies Latest reply on May 1, 2011 7:14 PM by omion

    Failed optimization ?

    hazeman
      restrict keyword ( or -fno-alias ) has no impact on ISA

      Restrict keyword is supposed to enable cached reads from global memory.

      But lets look at the attached kernel. Whether we use rescrict keyword or not there is no difference in produced ISA code. At the moment the only difference is that IL code has "_cached" added to uavs but it has no real impact on kernel execution.

      Why kernel with restrict keyword does have to wait for write completion ( WAIT_ACK: Outstanding acks<=0 ). Is it failed optimization ? Can we expect this fixed someday or is it required by hardware ( = failed hardware design ) ?

      PS. Compiled with driver  11.3, 11.4-preview. The problem is visible in every kernel I've tested.

      PS2. -fno-alias gives the same results.

      __kernel void test( __global float4 const* restrict in , __global float4* out ) { unsigned int gid = get_global_id(0); out[2*gid] = in[gid]; out[gid] = in[2*gid]; } ;-------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(17) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 x: MULLO_INT ____, R1.x, KC0[1].x y: MULLO_INT ____, R1.x, KC0[1].x z: MULLO_INT ____, R1.x, KC0[1].x w: MULLO_INT R0.w, R1.x, KC0[1].x 1 z: ADD_INT R0.z, R0.x, PV0.w 2 y: ADD_INT R0.y, PV1.z, KC0[6].x 3 x: LSHL R0.x, PV2.y, (0x00000005, 7.006492322e-45f).x z: LSHL R0.z, PV2.y, (0x00000004, 5.605193857e-45f).y 4 x: ADD_INT R0.x, KC1[0].x, PV3.x y: ADD_INT R0.y, KC1[0].x, PV3.z z: ADD_INT R0.z, KC1[1].x, PV3.z w: ADD_INT R0.w, KC1[1].x, PV3.x 5 x: LSHR R0.x, PV4.w, (0x00000002, 2.802596929e-45f).x y: LSHR R0.y, PV4.x, (0x00000004, 5.605193857e-45f).y w: LSHR R0.w, PV4.y, (0x00000004, 5.605193857e-45f).y 01 TEX: ADDR(64) CNT(1) 6 VFETCH R1, R0.w, fc175 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 02 MEM_RAT_CACHELESS_STORE_DWORD__NI_ACK: RAT(11)[R0], R1, ARRAY_SIZE(4) MARK VPM 03 WAIT_ACK: Outstanding_acks <= 0 04 TEX: ADDR(66) CNT(1) 7 VFETCH R1, R0.y, fc175 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 05 ALU: ADDR(49) CNT(2) 8 x: LSHR R0.x, R0.z, (0x00000002, 2.802596929e-45f).x 06 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0], R1, ARRAY_SIZE(4) MARK VPM 07 END END_OF_PROGRAM

        • Failed optimization ?
          nou

          it should be const restrict. not only restrict. and i -no-alias works.

            • Failed optimization ?
              hazeman

               

              Originally posted by: nou it should be const restrict. not only restrict. and i -no-alias works.


              Next time please don't respond if you didn't read the thread. It's called trolling.

              If you would read attached source you would notice the const keyword is present ! I know that opencl -> IL compiler uses this flag . It adds "_cached" to all required uavs. But IL->ISA compiler failes to use it properly and in effect we have the same code with no optimization !!!

                • Failed optimization ?
                  nou

                  ok sorry i overloked it. but i tryed patch Luxrays to use -fno-alias and it result into 5% performace improvment.

                    • Failed optimization ?
                      hazeman

                       

                      Originally posted by: nou ok sorry i overloked it. but i tryed patch Luxrays to use -fno-alias and it result into 5% performace improvment.

                       

                      OpenCL->IL compiler does some optimizations with restricted keyword. Like in the kernel

                      __kernel void test( __global float4 const* restrict in , __global float4* out 
                      {
                          unsigned int gid = get_global_id(0);

                          out[gid] = in[gid];
                          out[2*gid]   = in[gid];
                      }

                      it would read in[gid] only once and reuse this value. But it isn't "cached reads".

                • Failed optimization ?
                  MicahVillmow
                  hazeman,
                  You are getting caching on your read.
                  The difference between a cached and uncached read is the fetch constant 175 versus fc173.
                  This is a cached read:
                  01 TEX: ADDR(64) CNT(1)
                  6 VFETCH R1, R0.w, fc175 FORMAT(32_32_32_32_FLOAT)
                  FETCH_TYPE(NO_INDEX_OFFSET)

                  This is an uncached read:
                  01 TEX: ADDR(64) CNT(1)
                  6 VFETCH R1, R0.w, fc173 FORMAT(32_32_32_32_FLOAT)
                  FETCH_TYPE(NO_INDEX_OFFSET)


                  Also, this is what I get with internal compiler(i.e. next Catalyst release):
                  ; -------- Disassembly --------------------
                  00 ALU: ADDR(32) CNT(15) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)
                  0 x: MULLO_INT ____, R1.x, KC0[1].x
                  y: MULLO_INT R0.y, R1.x, KC0[1].x
                  z: MULLO_INT ____, R1.x, KC0[1].x
                  w: MULLO_INT ____, R1.x, KC0[1].x
                  1 x: ADD_INT R1.x, R0.x, PV0.y
                  2 w: ADD_INT R0.w, PV1.x, KC0[6].x
                  3 x: LSHL R0.x, PV2.w, (0x00000005, 7.006492322e-45f).x
                  z: LSHL R0.z, PV2.w, (0x00000004, 5.605193857e-45f).y
                  4 x: ADD_INT R0.x, KC1[0].x, PV3.z
                  y: ADD_INT R0.y, KC1[1].x, PV3.z
                  z: ADD_INT R0.z, KC1[1].x, PV3.x
                  5 x: LSHR R0.x, PV4.y, (0x00000002, 2.802596929e-45f).x
                  w: LSHR R0.w, PV4.x, (0x00000004, 5.605193857e-45f).y
                  01 TEX: ADDR(64) CNT(1)
                  6 VFETCH R1, R0.w, fc175 FORMAT(32_32_32_32_FLOAT)
                  FETCH_TYPE(NO_INDEX_OFFSET)
                  02 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0], R1, ARRAY_SIZE(4) MARK VPM
                  03 ALU: ADDR(47) CNT(2)
                  7 x: LSHR R0.x, R0.z, (0x00000002, 2.802596929e-45f).x
                  04 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0], R1, ARRAY_SIZE(4) MARK VPM
                  05 END
                  END_OF_PROGRAM


                  The reason is that there is a bug in 2.4 which wasn't fix in time for release, so expect it with a driver update.
                    • Failed optimization ?
                      hazeman

                      Thanks Micah for answer. But i think you posted disassembly of second kernel ( with only 1 read ).

                      If you could please post result of compilation of first kernel ( out[2*gid] = in[gid]; out[gid] = in[2*gid]; ). I hope new driver doesn't generate WAIT_ACK there ( it's really killing performance even if the reads use cache ).

                      • Failed optimization ?
                        omion
                        @Micah

                        I have a function which has a number of different fetch constants. Is there any list of which constants represent caching? I have seen fc130, fc131, fc153, and I think a few others.
                      • Failed optimization ?
                        MicahVillmow
                        hazeman,
                        The ack still exists, this is outside of caching issue. The problem is a software limitation that isn't expected to be fixed anytime soon. A workaround is to not interleave reads and writes. This code does not have any ack's.
                        __kernel void test( __global float4 const* restrict in , __global float4* out)
                        {
                        unsigned int gid = get_global_id(0);
                        float4 a = in[gid];
                        float4 b = in[2*gid];
                        out[gid] = a;
                        out[2*gid] = b;
                        }
                          • Failed optimization ?
                            hazeman

                            You know "fixing" example which was designed only to show the problem is quite easy ...

                            But in real world usually you can't move writes to the end of kernel. For example, one of my kernels multiplies 1 matrix with >200 vectors. WAIT_ACK simply kills the performance ( >10x slower than version using TU ). And this kind of bugs really make working with AMD cards frustrating.

                            With the state of OpenCL implementation I really wonder why AMD/ATI thinks of dropping CAL support. The only efficient kernels are written on top of CAL. AMD OpenCL is simply useless now for any high performance code.

                             

                          • Failed optimization ?
                            MicahVillmow
                            hazeman,
                            I understand the frustration. The problem is that AMD IL has a maximum of 12 UAV's and OpenCL allows pretty much unlimited pointers, and the mapping for that causes information to be lost in the translation between OpenCL to IL. The reason why you see the ACK's is because both pointers go to the same UAV and the shader compiler does not know that they point to different memory objects. This is why the 'ack's are introduced because the compiler has to keep memory coherent. There are workarounds for this problem for specific cases, but not a general solution, as it is a fundamental issue with the AMDIL language.