I have written a simulation which used __local memory to hold simulation data. To lift the limitation of simulation size imposed by the size of shared memory, I moved all data to __global memory. There was only a ~10% slowdown in runspeed, becuase many operations are done in registers before result is written back to memory. I tried the prefetch() function to make reads faster, but nothing really happened.
Work-group size is 256, at every iterational step 3 uint4 vectors are loaded from different buffers into registers, 16 float4 random numbers are created, 32 nested select statements (all with single logical operations inside) are passed, roughly 64 bitshifts and roughly the same bitwise operations are done. At the end output is made back to __global. Are all these operations really enough to hide __global read latency without the prefetch() function?
The prefetch function isn't explained anywhere (as far as I saw) what it would be good for, but I would think it works somewhat as an async copy. If on knows in advance what data will be read in the following iterational step (and the data is not changed in the current step), then it can be prefetched at the beginning at the iteration, and by the time the next iterational step starts, when data is read it will be copied from the global read cache and not from the VRAM.
Correct me if I'm wrong.