From manual:
All AMD Evergreen GPUs contain a 32K LDS for each compute unit. On high-
end GPUs, the LDS contains 32-banks, each bank is four bytes long
If each bank contains 4 bytes and there are 32 banks, where other 32*1024-32*4 bytes of LDS are located ??

And later:
Note that a sequential access pattern, where each work-item reads a float4 value
from LDS, uses only half the banks on each cycle on the ATI Radeontm HD 5870
GPU and delivers half the performance of the float2 access pattern
How it could be?
0x00-0x03 go to bank1, 0x03-0x07 to bank2 and so on, right? that is, 16(quater-wavefront)*4(float4)*4(bytes per float)=256 bytes read and should cover each bank 2 times more than float2 accesses, no?