Higher end GPUs have 32-banks for LDS memory, meaning one work item can read/write 8 bytes of data, instead of 4 bytes, basically doubling LDS bandwidth. But current OpenCL compiler makes it hard to take advantage of this feature. One would think that access to float2 array in local memory would result in compiler generating LDS_READ2_RET/LDS_WRITE_REL instructions, but most of the time compiler generates 2 LDS_READ_RET/LDS_WRITE instructions.
For example, piece of code below, resulted in one LDS_WRITE_REL, and 4 LDS_WRITE instructions. So instead of 3 store instructions, I got 5, almost halving maximum bandwith. Same happens when I read from local memory, some 8-byte read instructions are split into two 4-byte read instructions.
__local float2 localBuf; localBuf[localId] = (float2)(vMin.x, vMax.x); localBuf[localId] = (float2)(vMin.y, vMax.y); localBuf[localId] = (float2)(vMin.z, vMax.z);
Since LDS_READ2_RET instruction can take two separate addresses, compiler could optimize almost all pairs of reads from local memory into that instruction. For example:
float sum = localBuf[ndx] + localBuf[ndx + stride]; // 1 LDS_READ2_RET instead of 2 LDS_READ_RET float val = min(localBuf[i] + localBuf[j]); // same here, even if indices are unrelated
As for LDS_WRITE_REL, it accepts one index and constant offset, so compiler could optimize such expressions like:
localBuf[ndx] = val0; localBuf[ndx + 1] = val1;
into one LDS_WRITE_REL, instead of two LDS_WRITE instructions.