LDS is 32kB and we have 4Bytes memory blocks in LDS.So we need to have 8K such blocks in all.
We divide these 8k memory blocks into 32 segments each of 512 memory elements.So we have memory banks having a depth of 512 memory elements.
Now at any time we can have access to only one memory element(4Bytes) from a bank(which will be selected depending on the address bits).
If we try to read a float4 by each quad-wavefront,we read from 4 banks/per thread.So only 8 threads can read in one time(8*4=32banks) and rest of the threads have to wait.But when we read float2 each thread reads only from 2 banks,so 16 threads read from all the available banks(16*2=32 banks).So no thread now waits and hence performance increases.
To clarify on that, each compute unit (5-way VLIW block) has a 64-bit read port to LDS that carries a single address. It can read two consecutive 32-bit words from LDS on each cycle. As a result if you attempt to read a 128-bit entry, say a float4, it will read half of it on each cycle. All well and good.
However this is basically a 64-bit read from each lane with a stride of 128-bits. If you work out the addressing on that you'll see that you have 2-way conflicts. The first 8 will read, the second 8 will conflict and have to wait. You only get 1/8 of the wavefront doing its reads on a given cycle and hence are executing at 50% of the bandwidth.
ETA: It's been pointed out that my interpretation of diagrams was a little inaccurate. The LDS read2 instruction actually does take two addresses according to the ISA docs.
I'll have the wording for this section clarified to make it easier to understand. The problem with the float4 access pattern is that threads 0-7 cover all 32 banks, but only 2 reads per thread can occur per cycle, so in cycle 1, 0/1, 4/5, 8/9, 12/13, etc.. are read and threads 8-15 attempt to read from the same location causing bank conflicts. In the second cycle 2/3, 6/7, 10/11, etc.. are read and again threads 8-15 attempt to read from the same location causing more bank conflicts. So if reading a float2 takes N cycles for a wavefront, reading a float4 takes 4N cycles instead of 2N.
Thanks a lot to all! It's a pleasure to recive so detailed and fast answer!