Archives Discussions

aisesal · ‎04-20-2012

Higher end GPUs have 32-banks for LDS memory, meaning one work item can read/write 8 bytes of data, instead of 4 bytes, basically doubling LDS bandwidth. But current OpenCL compiler makes it hard to take advantage of this feature. One would think that access to float2 array in local memory would result in compiler generating LDS_READ2_RET/LDS_WRITE_REL instructions, but most of the time compiler generates 2 LDS_READ_RET/LDS_WRITE instructions.

For example, piece of code below, resulted in one LDS_WRITE_REL, and 4 LDS_WRITE instructions. So instead of 3 store instructions, I got 5, almost halving maximum bandwith. Same happens when I read from local memory, some 8-byte read instructions are split into two 4-byte read instructions.

__local float2 localBuf[3][256];
localBuf[0][localId] = (float2)(vMin.x, vMax.x);
localBuf[1][localId] = (float2)(vMin.y, vMax.y);
localBuf[2][localId] = (float2)(vMin.z, vMax.z);

Since LDS_READ2_RET instruction can take two separate addresses, compiler could optimize almost all pairs of reads from local memory into that instruction. For example:

float sum = localBuf[ndx] + localBuf[ndx + stride]; // 1 LDS_READ2_RET instead of 2 LDS_READ_RET
float val = min(localBuf + localBuf); // same here, even if indices are unrelated

As for LDS_WRITE_REL, it accepts one index and constant offset, so compiler could optimize such expressions like:

localBuf[ndx] = val0;
localBuf[ndx + 1] = val1;

into one LDS_WRITE_REL, instead of two LDS_WRITE instructions.

jeff_golds · ‎04-20-2012

Yes, that can be done, except that you need to compute two address prior to the READ2. In other words, the compiler would schedule the code as:

compute address1

compute address2

LDS_READ2_RET address1, address2

schedule work

get results

instead of

compute address1

LDS_READ_RET address1

compute address2

LDS_READ_RET address2

This would provide no real savings as you need to schedule work before accessing the first request anyway. If you perform several LDS reads back-to-back, you'll see a pattern like this:

compute address1

LDS_READ_RET address1

compute address2

compute address3

LDS_READ2_RET address2, address3

compute address4

compute address5

LDS_READ2_RET address4, address5

compute address6

LDS_READ_RET address6

aisesal · ‎04-21-2012

Thanks for a detailed replay. I understand that computing two addresses and then scheduling work might provide no real savings, but there are more cases where using LDS_READ2_RET would increase performance. For example, this kind of pattern:

float result = localBuf[ndx * stride] + localBuf[ndx * stride + offset]; // instead of addition it can be any operation that takes two inputs, and produces one output

could compile into:

R0.x = ndx
R0.y = stride
R0.z = offset
R0.w = pointer to localBuf
R1.x = result
0 x: MUL_UINT24 ____, R0.x, R0.y
  y: MULADD_UINT24 ____, R0.x, R0.y, R0.z
1 x: MULADD_UINT24 ____, PV0.x, 4, R0.w
  y: MULADD_UINT24 ____, PV0.y, 4, R0.w
2 x: LDS_READ2_RET QAB, PV1.x, PV1.y
3 x: ADD R1.x, QA[2].pop, QB[2].pop

In cases where stride and offset are known at compile time and pointer to localBuf is 0, it could be optimized even more:

R0.x = ndx
R1.x = result
0 x: MUL_UINT24 ____, R0.x, <stride * 4>
  y: MULADD_UINT24 ____, R0.x, <stride * 4>, <offset * 4>
1 x: LDS_READ2_RET QAB, PV0.x, PV0.y
2 x: ADD R1.x, QA[2].pop, QB[2].pop

It's safe to use 24-bit integer mul/muladd, because LDS is 32K in size, so all variables used in computation of address should use no more than 15-bits. Also ndx and stride should be unsigned, as there is no signed 24-bit integer mul/muladd.

As for LDS_WRITE_REL, it takes only one address, so there's no need to compute two addresses, so compiler shouldn't split it into two LDS_WRITE instructions, unless there's a good reason for that.

Archives Discussions

Using all 32-banks of LDS.