stgatilov

Unnecessary (LHSR x, y, 2) instructions for 32bit global memory reads

Discussion created by stgatilov on Oct 25, 2011

The OpenCL kernel compiler generates one LSHR instruction per each 32-bit memory load that seems to be unnecessary. It is almost never optimized out of the code. The attached code shows the redundant operations clearly. Ideally, a single LSHR operation must be enough for main1 kernel.

As far as we understand, VFETCH instruction requires address not in bytes, but in 32-bit words. That's why the (LSHR ?, ?, 2) instruction is generated. However, it is often known that the address of the buffer is aligned by 4 bytes. So instead of dividing address of each element by 4 compiler can divide address of the buffer by 4 and add index.

It seems that compiler does not perform the optimization because it wants to calculate the 30-th and 31-st bits of address in words precisely. For example:

Let A = 0xFFFFFFFC, i = 1;

Then:

   word_addr(A) = (0xFFFFFFFC + 4*1) / 4

   word_addr(A) + i = (0xFFFFFFFC/4) + 1 = 0x40000000

So the resulting address in words can be different in 30-th and 31-st bits and compiler cannot optimize it.

However, GPU global address space is 32-bit and there is no more that 4Gb memory available. So the 30th and 31st bits of address in words for VFETCH instruction can be ignored in hardware (and I suppose, they are ignored now). So there is no need for preserving those high-order bits.

The original discussion on this matter can be seen on topcoder forums:

http://apps.topcoder.com/forums/?module=Thread&threadID=724338&start=0

(AMD APP 2.5, Radeon 6xxx series)



__kernel void main1(__global float *I, __global float *O) { __private float Buffer[4]; for(int n = 0; n != 4; n++) Buffer[n] = I[n + 1]; for(int n = 0; n != 4; n++) O[n] = Buffer[n]; } __kernel void main2(__global float *I, __global float *O) { __private float Buffer[4]; for(int n = 0; n != 4; n++) { volatile uint Address = ((uint)I >> 2) + n + 1; Buffer[n] = *(__global float*)((Address ) << 2); } for(int n = 0; n != 4; n++) O[n] = Buffer[n]; } /* ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(12) KCACHE0(CB1:0-15) 0 x: ADD_INT ____, KC0[0].x, 4 y: ADD_INT ____, KC0[0].x, 16 z: ADD_INT ____, KC0[0].x, 12 w: ADD_INT ____, KC0[0].x, 8 1 x: LSHR R0.x, PV0.x, 2 y: LSHR R0.y, PV0.z, 2 z: LSHR R0.z, PV0.w, 2 w: LSHR R0.w, PV0.y, 2 t: LSHR R4.x, KC0[1].x, 2 01 TEX: ADDR(48) CNT(4) 2 VFETCH R3.x___, R0.w, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 3 VFETCH R2.x___, R0.y, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 4 VFETCH R1.x___, R0.z, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 5 VFETCH R0.x___, R0.x, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(44) CNT(3) 6 y: MOV R0.y, R1.x z: MOV R0.z, R2.x VEC_120 w: MOV R0.w, R3.x VEC_201 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R4], R0, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(14) KCACHE0(CB1:0-15) 0 x: LSHR R4.x, KC0[1].x, 2 z: LSHR ____, KC0[0].x, 2 1 x: ADD_INT ____, PV0.z, 1 y: ADD_INT ____, PV0.z, 4 z: ADD_INT ____, PV0.z, 2 w: ADD_INT ____, PV0.z, 3 2 x: AND_INT R0.x, PV1.y, 0x3FFFFFFF y: AND_INT R0.y, PV1.z, 0x3FFFFFFF z: AND_INT R0.z, PV1.w, 0x3FFFFFFF w: AND_INT R0.w, PV1.x, 0x3FFFFFFF 01 TEX: ADDR(64) CNT(4) 3 VFETCH R1.x___, R0.y, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 4 VFETCH R2.x___, R0.z, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 5 VFETCH R0.x___, R0.x, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 6 VFETCH R3.x___, R0.w, fc153 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(46) CNT(3) 7 y: MOV R3.y, R1.x z: MOV R3.z, R2.x VEC_120 w: MOV R3.w, R0.x VEC_201 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R4], R3, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM */

Outcomes