2 Replies Latest reply on Apr 21, 2012 2:44 AM by aisesal@gmail.com

    Using all 32-banks of LDS.


      Higher end GPUs have 32-banks for LDS memory, meaning one work item can read/write 8 bytes of data, instead of 4 bytes, basically doubling LDS bandwidth. But current OpenCL compiler makes it hard to take advantage of this feature. One would think that access to float2 array in local memory would result in compiler generating LDS_READ2_RET/LDS_WRITE_REL instructions, but most of the time compiler generates 2 LDS_READ_RET/LDS_WRITE instructions.


      For example, piece of code below, resulted in one LDS_WRITE_REL, and 4 LDS_WRITE instructions. So instead of 3 store instructions, I got 5, almost halving maximum bandwith. Same happens when I read from local memory, some 8-byte read instructions are split into two 4-byte read instructions.

      __local float2 localBuf[3][256];
      localBuf[0][localId] = (float2)(vMin.x, vMax.x);
      localBuf[1][localId] = (float2)(vMin.y, vMax.y);
      localBuf[2][localId] = (float2)(vMin.z, vMax.z);


      Since LDS_READ2_RET instruction can take two separate addresses, compiler could optimize almost all pairs of reads from local memory into that instruction. For example:

      float sum = localBuf[ndx] + localBuf[ndx + stride]; // 1 LDS_READ2_RET instead of 2 LDS_READ_RET
      float val = min(localBuf[i] + localBuf[j]); // same here, even if indices are unrelated


      As for LDS_WRITE_REL, it accepts one index and constant offset, so compiler could optimize such expressions like:

      localBuf[ndx] = val0;
      localBuf[ndx + 1] = val1;

      into one LDS_WRITE_REL, instead of two LDS_WRITE instructions.

        • Re: Using all 32-banks of LDS.

          Yes, that can be done, except that you need to compute two address prior to the READ2.  In other words, the compiler would schedule the code as:

          compute address1

          compute address2

          LDS_READ2_RET address1, address2

          schedule work

          get results


          instead of

          compute address1

          LDS_READ_RET address1

          compute address2

          LDS_READ_RET address2


          This would provide no real savings as you need to schedule work before accessing the first request anyway.  If you perform several LDS reads back-to-back, you'll see a pattern like this:

          compute address1

          LDS_READ_RET address1

          compute address2

          compute address3

          LDS_READ2_RET address2, address3

          compute address4

          compute address5

          LDS_READ2_RET address4, address5

          compute address6

          LDS_READ_RET address6

            • Re: Using all 32-banks of LDS.

              Thanks for a detailed replay. I understand that computing two addresses and then scheduling work might provide no real savings, but there are more cases where using LDS_READ2_RET would increase performance. For example, this kind of pattern:

              float result = localBuf[ndx * stride] + localBuf[ndx * stride + offset]; // instead of addition it can be any operation that takes two inputs, and produces one output

              could compile into:

              R0.x = ndx
              R0.y = stride
              R0.z = offset
              R0.w = pointer to localBuf
              R1.x = result
              0 x: MUL_UINT24 ____, R0.x, R0.y
                y: MULADD_UINT24 ____, R0.x, R0.y, R0.z
              1 x: MULADD_UINT24 ____, PV0.x, 4, R0.w
                y: MULADD_UINT24 ____, PV0.y, 4, R0.w
              2 x: LDS_READ2_RET QAB, PV1.x, PV1.y
              3 x: ADD R1.x, QA[2].pop, QB[2].pop


              In cases where stride and offset are known at compile time and pointer to localBuf is 0, it could be optimized even more:

              R0.x = ndx
              R1.x = result
              0 x: MUL_UINT24 ____, R0.x, <stride * 4>
                y: MULADD_UINT24 ____, R0.x, <stride * 4>, <offset * 4>
              1 x: LDS_READ2_RET QAB, PV0.x, PV0.y
              2 x: ADD R1.x, QA[2].pop, QB[2].pop


              It's safe to use 24-bit integer mul/muladd, because LDS is 32K in size, so all variables used in computation of address should use no more than 15-bits. Also ndx and stride should be unsigned, as there is no signed 24-bit integer mul/muladd.


              As for LDS_WRITE_REL, it takes only one address, so there's no need to compute two addresses, so compiler shouldn't split it into two LDS_WRITE instructions, unless there's a good reason for that.