Based on the document posted on another topic: http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=111092&enterthread=y
1) Is the read throughput of LDS and global memory the same? I'm asking because, if so, I can give up on trying to use LDS for broadcasting...
2) Is possible to use burst reads and broadcast directly from global memory? Or it is only from LDS?
3) What's the throughput of the burst mode?
4) Can Brook+ or CAL generate non-water-fall or broadcast read with a "wavefront-id" index? If not, this is a feature request.
5) In your opinion, if I need broadcast in a pattern similar to matrix-multiplication does loading data to several registers and then each thread selecting onde of them works?
Thank you in advance.