4 Replies Latest reply on Oct 7, 2010 2:16 PM by Raistmer

    LDS banks on HD5870

      From manual:
      All AMD Evergreen GPUs contain a 32K LDS for each compute unit. On high-
      end GPUs, the LDS contains 32-banks, each bank is four bytes long
      If each bank contains 4 bytes and there are 32 banks, where other 32*1024-32*4 bytes of LDS are located ??

      And later:
      Note that a sequential access pattern, where each work-item reads a float4 value
      from LDS, uses only half the banks on each cycle on the ATI Radeontm HD 5870
      GPU and delivers half the performance of the float2 access pattern
      How it could be?
      0x00-0x03 go to bank1, 0x03-0x07 to bank2 and so on, right? that is, 16(quater-wavefront)*4(float4)*4(bytes per float)=256 bytes read and should cover each bank 2 times more than float2 accesses, no?
        • LDS banks on HD5870


          LDS is 32kB and we have 4Bytes memory blocks in LDS.So we need to have 8K such blocks in all.

          We divide these 8k memory blocks into 32 segments each of 512 memory elements.So we have memory banks having  a depth of 512 memory elements.

          Now at any time we can have access to only one memory element(4Bytes) from a bank(which will be selected depending on the address bits).

          If we try to read a float4 by each quad-wavefront,we read from 4 banks/per thread.So only 8 threads can read in one time(8*4=32banks) and rest of the threads have to wait.But when we read float2 each thread reads only from 2 banks,so 16 threads read from all the available banks(16*2=32 banks).So no thread now waits and hence performance increases.

            • LDS banks on HD5870

              To clarify on that, each compute unit (5-way VLIW block) has a 64-bit read port to LDS that carries a single address. It can read two consecutive 32-bit words from LDS on each cycle. As a result if you attempt to read a 128-bit entry, say a float4, it will read half of it on each cycle. All well and good.

              However this is basically a 64-bit read from each lane with a stride of 128-bits. If you work out the addressing on that you'll see that you have 2-way conflicts. The first 8 will read, the second 8 will conflict and have to wait. You only get 1/8 of the wavefront doing its reads on a given cycle and hence are executing at 50% of the bandwidth.

              ETA: It's been pointed out that my interpretation of diagrams was a little inaccurate. The LDS read2 instruction actually does take two addresses according to the ISA docs.


            • LDS banks on HD5870
              I'll have the wording for this section clarified to make it easier to understand. The problem with the float4 access pattern is that threads 0-7 cover all 32 banks, but only 2 reads per thread can occur per cycle, so in cycle 1, 0/1, 4/5, 8/9, 12/13, etc.. are read and threads 8-15 attempt to read from the same location causing bank conflicts. In the second cycle 2/3, 6/7, 10/11, etc.. are read and again threads 8-15 attempt to read from the same location causing more bank conflicts. So if reading a float2 takes N cycles for a wavefront, reading a float4 takes 4N cycles instead of 2N.
              • LDS banks on HD5870
                Thanks a lot to all! It's a pleasure to recive so detailed and fast answer!