Dear OpenCL users, I recently ported a kernel from CUDA to OpenCL.
This kernel process a 2D image (~512²) and for each pixel, fetch ~8000 coordinates in global memory
Then it will fetch ~8000 times in the 2D image using this coordinates.
The kernel analyzer says the bottleneck is mem fetches, not ALUs
On Nvidia 570, kernel has identical performances in CUDA or OpenCL
When running a Radeon 7850 (I think performances should be close to the GX570), code is 5 times slower.
I changed my code to use shared memory and reduce the amount of global memory fetches.
Now the kernel analyzer says the bottleneck is ALU Ops.
But the 7850 is still 2.5x times slower that the GTX570.
Any tips regarding:
- the reason why ATI is slower for this kind of kernel
- optimization of this Kernel for ATI (my coordinates array is constant for all kernel launches)
PS: the 2D image is in fact a 32bit greyscale pic.
I'm currently using a CL_R - CL_SIGNED_INT32 image format.
Could this explain bad performances of my read_imagei() calls?
Thanks a lot for your help