Please find attached the clinfo output from our AMD machine, containing AMD Firepro V4800 as a discrete GPU.
The OpenCL rectangular copy function crashes when CPU is used as an OpenCL device on the AMD Fusion APU.
Is there any way to workaround this?
Rectangular-copy from GPU to CPU of data which is not contiguous in memory is very slow. For a rectangle of size 4096x4096, copying the data if it is not contiguous in memory takes 6.7 times the time taken to copy the data when it is contiguous in memory. The same ratio on our NVIDIA Tesla C2050 machine is 1.34.
The results (on NVIDIA Tesla C2050 and AMD Firepro V4800) comparing the performance of rectangular-copy from GPU to CPU for different rectangle sizes, when the data to be copied is contiguous in memory and when it is not, can be found here:
https://docs.google.com/spreadsheet/ccc?key=0AjF_xyN9QxOBdE5JS2x4ZzN1MllVMGFWVzIzdnJ1RGc#gid=0
The performance of rectangular-copy from CPU to GPU is similar.
What is the reason for such a huge slowdown in rectangular-copy from GPU to CPU when the data to be copied is not contiguous in memory? The motivation behind using a rectangular-copy is to avoid such a huge slowdown.
Are there any ways in which we can overcome this? Can we improve the performance of copying non-contiguous memory from GPU to CPU (and from CPU to GPU) in some way?