cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

thejascr
Adept II

OpenCL rectangular-copy function is either slow or crashes on a AMD machine

Please find attached the clinfo output from our AMD machine, containing AMD Firepro V4800 as a discrete GPU.

The OpenCL rectangular copy function crashes when CPU is used as an OpenCL device on the AMD Fusion APU.

Is there any way to workaround this?

Rectangular-copy from GPU to CPU of data which is not contiguous in memory is very slow. For a rectangle of size 4096x4096, copying the data if it is not contiguous in memory takes 6.7 times the time taken to copy the data when it is contiguous in memory. The same ratio on our NVIDIA Tesla C2050 machine is 1.34.

The results (on NVIDIA Tesla C2050 and AMD Firepro V4800) comparing the performance of rectangular-copy from GPU to CPU for different rectangle sizes, when the data to be copied is contiguous in memory and when it is not, can be found here:

https://docs.google.com/spreadsheet/ccc?key=0AjF_xyN9QxOBdE5JS2x4ZzN1MllVMGFWVzIzdnJ1RGc#gid=0

The performance of rectangular-copy from CPU to GPU is similar.

What is the reason for such a huge slowdown in rectangular-copy from GPU to CPU when the data to be copied is not contiguous in memory? The motivation behind using a rectangular-copy is to avoid such a huge slowdown.

Are there any ways in which we can overcome this? Can we improve the performance of copying non-contiguous memory from GPU to CPU (and from CPU to GPU) in some way?

0 Likes
2 Replies
himanshu_gautam
Grandmaster

The results would be impacted by the PCI bandwidth at large. Are the two system equivalent in this regard?

Also it would be nice if you can share the code. Not sure what you mean by non-contigous rectangular copy.

clinfo shows it is a dual gpu V4800, won't the two GPU share PCI bandwidth? Probably you should give more information about the process you followed.

0 Likes

We are not comparing the absolute numbers. We are comparing the ratios, i.e., the relative performance of copying contiguous data and of copying non-contiguous data. Why would the PCI bandwidth have an effect?

The code used is straight-forward. Here is an example with more details:

  1. For copying data which is contiguous in memory: An array A of size 512 x 512 was used to copy 512 x 512 elements, and the entire array was copied using OpenCL rectangular-copy; i.e., (0,0) to (511,511) was the rectangle within A which was copied. In this case, the 512 x 512 elements to be copied are contiguous in memory.
  2. For copying data which is not contiguous in memory: An array A of size 1024 x 1024 was used to copy 512 x 512 elements, and the first 512 column elements in the first 512 rows were copied using OpenCL rectangular-copy; i.e., (0,0) to (511,511) was the rectangle within A which was copied. In this case, the 512 x 512 elements to be copied are not contiguous in memory.

The performance of these two on the same GPU are compared.

We are using only 1 discrete GPU, and the other discrete GPU as well as the APU is idle.

0 Likes