I played with OpenCL on a 5870 and got 118 GB/s of bandwidth doing a copy between 2 arrays in global memory.
118GB/s was the best result, using float4, with 32-bit floats it gave 98 GB/s.
The code is similar to the "float4 vs float1" code in the OpenCL programming guide, just moving one float4 per work item.
That's a bit low compared to the peak of 154 GB/s, that's only ~76 % I would have hoped to see something closer to 130 GB/s. Is this number typical ? I'm running on Linux. Does Windows give higher numbers ?
What can I expect with the latest cards, like the 6970 ?
Any idea how I can improve this number ?