Does anybody know if the hardware is able to "burst" global memory reads as well as writes (if this is a meaningful idea) and if so how to write IL to do this? My
seems to generate 4 MEM_GLOBAL_READ_IND gpuisa instructions, whereas the code with the src/dst's interchanged generate 1 MEM_GLOBAL_WRITE_IND with a BRSTCNT(3). I am concerned about memory bandwidth.
Relatedly, can I check that the theoretical memory bandwidth of a 3870 say is about 70GB/s? Is "all" of this accessible for any of global buffer reads only, writes only or read and writes together? If not I am worried that any code I write using mainly a global buffer will be doomed to be slow from the start, especially as some of the SDK examples seem to give numbers of order only 9GB/s (e.g. bursting_IL). Or will this change for the new cards?
Are there any other tips one can give for achieving maximum global buffer bandwidth? (One thing I have mooted for example is having "tall and thin" domains, e.g. (2,512), so that if a buffer is basically accessed by vObjIndex0.x each quad should be accessing sequential memory. I haven't had chance to test this in any way - does it make sense though and might it help?)
Thanks a lot,