0 Replies Latest reply on Oct 25, 2011 9:59 AM by stgatilov

    Unnecessary (LHSR x, y, 2) instructions for 32bit global memory reads

    stgatilov

      The OpenCL kernel compiler generates one LSHR instruction per each 32-bit memory load that seems to be unnecessary. It is almost never optimized out of the code. The attached code shows the redundant operations clearly. Ideally, a single LSHR operation must be enough for main1 kernel.

      As far as we understand, VFETCH instruction requires address not in bytes, but in 32-bit words. That's why the (LSHR ?, ?, 2) instruction is generated. However, it is often known that the address of the buffer is aligned by 4 bytes. So instead of dividing address of each element by 4 compiler can divide address of the buffer by 4 and add index.

      It seems that compiler does not perform the optimization because it wants to calculate the 30-th and 31-st bits of address in words precisely. For example:

      Let A = 0xFFFFFFFC, i = 1;

      Then:

         word_addr(A) = (0xFFFFFFFC + 4*1) / 4

         word_addr(A) + i = (0xFFFFFFFC/4) + 1 = 0x40000000

      So the resulting address in words can be different in 30-th and 31-st bits and compiler cannot optimize it.

      However, GPU global address space is 32-bit and there is no more that 4Gb memory available. So the 30th and 31st bits of address in words for VFETCH instruction can be ignored in hardware (and I suppose, they are ignored now). So there is no need for preserving those high-order bits.

      The original discussion on this matter can be seen on topcoder forums:

      http://apps.topcoder.com/forums/?module=Thread&threadID=724338&start=0

      (AMD APP 2.5, Radeon 6xxx series)



      __kernel void main1(__global float *I, __global float *O) { __private float Buffer[4]; for(int n = 0; n != 4; n++) Buffer[n] = I[n + 1]; for(int n = 0; n != 4; n++) O[n] = Buffer[n]; } __kernel void main2(__global float *I, __global float *O) { __private float Buffer[4]; for(int n = 0; n != 4; n++) { volatile uint Address = ((uint)I >> 2) + n + 1; Buffer[n] = *(__global float*)((Address ) << 2); } for(int n = 0; n != 4; n++) O[n] = Buffer[n]; } /* ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(12) KCACHE0(CB1:0-15) 0 x: ADD_INT ____, KC0[0].x, 4 y: ADD_INT ____, KC0[0].x, 16 z: ADD_INT ____, KC0[0].x, 12 w: ADD_INT ____, KC0[0].x, 8 1 x: LSHR R0.x, PV0.x, 2 y: LSHR R0.y, PV0.z, 2 z: LSHR R0.z, PV0.w, 2 w: LSHR R0.w, PV0.y, 2 t: LSHR R4.x, KC0[1].x, 2 01 TEX: ADDR(48) CNT(4) 2 VFETCH R3.x___, R0.w, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 3 VFETCH R2.x___, R0.y, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 4 VFETCH R1.x___, R0.z, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 5 VFETCH R0.x___, R0.x, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(44) CNT(3) 6 y: MOV R0.y, R1.x z: MOV R0.z, R2.x VEC_120 w: MOV R0.w, R3.x VEC_201 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R4], R0, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(14) KCACHE0(CB1:0-15) 0 x: LSHR R4.x, KC0[1].x, 2 z: LSHR ____, KC0[0].x, 2 1 x: ADD_INT ____, PV0.z, 1 y: ADD_INT ____, PV0.z, 4 z: ADD_INT ____, PV0.z, 2 w: ADD_INT ____, PV0.z, 3 2 x: AND_INT R0.x, PV1.y, 0x3FFFFFFF y: AND_INT R0.y, PV1.z, 0x3FFFFFFF z: AND_INT R0.z, PV1.w, 0x3FFFFFFF w: AND_INT R0.w, PV1.x, 0x3FFFFFFF 01 TEX: ADDR(64) CNT(4) 3 VFETCH R1.x___, R0.y, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 4 VFETCH R2.x___, R0.z, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 5 VFETCH R0.x___, R0.x, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 6 VFETCH R3.x___, R0.w, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(46) CNT(3) 7 y: MOV R3.y, R1.x z: MOV R3.z, R2.x VEC_120 w: MOV R3.w, R0.x VEC_201 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R4], R3, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM */