I want to exchange data between the threads of a wavefront via LDS, but I'm not sure I get the maximum performance. Unfortunately Stream Profiler is not working on my Platform (Windows 7 64bit).
I have read that the LDS is composed of 32 banks of width 32 bits. A bank cannot process more than one access per clock. So I'm trying to provide a DWORD offset based on thread ID.
But, what is actually precisely happening when my IL kernel executes a lds_store instruction? Will there be four sets of 16 write accesses, with thread ids:
First set: id 15...0
Second set: id 31..16 and so forth? Or is it completely different?
Any help greatly appreciated.