1 Reply Latest reply on Feb 7, 2016 9:26 AM by realhet

    GCN LDS Bank Optimization 4-byte vs 8-byte Memory Access Patterns

    optimiz3

      From the AMD Accelerated Parallel Processing OpenCL Programming Guide, section 6.2, page 6-10:

       

      The LDS contains 32-banks, each bank is four bytes wide and 256 bytes deep; the bank address is determined by bits 6:2 in the address.

       

      and:

       

      Bank conflicts are determined by what addresses are accessed on each half wavefront boundary. Threads 0 through 31 are checked for conflicts as are threads 32 through 63 within a wavefront.

       

      This would imply the lane for each bank is 4-bytes wide, meaning the optimal access pattern would be each thread accesses a consecutive uint.

       

      So far so good, but then this comes up:

       

      Ensure, as much as possible, that the memory requests generated from a quarter-wavefront avoid bank conflicts by using unique address bits 6:2. A simple sequential address pattern, where each work-item reads a float2 value from LDS, generates a conflict-free access pattern on the AMD Radeonīƒ¤ HD 7XXX GPU.

      This contradicts the first quote as each half-wavefront accesses addresses with the same 6:2 bits twice which according to the first quote should cause bank conflicts.

       

      Which is it? Do sequential uint2s cause bank-conflicts? Or is it that while the first two quotes are technically accurate, it would be better to say "Threads 0 through 15, 16 though 31, 32 through 47, and 48 through 63 are checked for conflicts within a wavefront" since each wavefront is executed in quarter-wave front units?

        • Re: GCN LDS Bank Optimization 4-byte vs 8-byte Memory Access Patterns
          realhet

          Hi,

          As I know, there are 32 dword sized banks. Period.

           

          If you read 64 consecutive dwords, then it will take 2 cycles to process. Every bank will work, there will be no conflicts. That's the fastest speed of the LDS.

           

          In the second example you read 64 consecutive float2's. First those are split to dwords and every LDS bank will handle 4x reads. It will take 4 cycles and because of no bank conflicts, all the banks will be busy.

           

          Both examples are using LDS at max utilization. Only the latter has 2x as much data to work with.