from the performance guide:
"The GPU memory subsystem can coalesce multiple concurrent accesses to global memory, provided the memory addresses increase sequentially across the work-items in the wavefront and start on a 128-byte alignment boundary."
so code like the following would be most efficient:
float* data = ...
data[get_global_id(0)] = ...
... = data[get_global_id(0)]
however, does this also apply to vector data?
float4* data = ...
data[get_global_id(0)] = ...
... = data[get_global_id(0)]
regards,
- Tom