Dear OpenCL users, I recently ported a kernel from CUDA to OpenCL.
This kernel process a 2D image (~512²) and for each pixel, fetch ~8000 coordinates in global memory
Then it will fetch ~8000 times in the 2D image using this coordinates.
The kernel analyzer says the bottleneck is mem fetches, not ALUs
On Nvidia 570, kernel has identical performances in CUDA or OpenCL
When running a Radeon 7850 (I think performances should be close to the GX570), code is 5 times slower.
I changed my code to use shared memory and reduce the amount of global memory fetches.
Now the kernel analyzer says the bottleneck is ALU Ops.
But the 7850 is still 2.5x times slower that the GTX570.
Any tips regarding:
- the reason why ATI is slower for this kind of kernel
- optimization of this Kernel for ATI (my coordinates array is constant for all kernel launches)
PS: the 2D image is in fact a 32bit greyscale pic.
I'm currently using a CL_R - CL_SIGNED_INT32 image format.
Could this explain bad performances of my read_imagei() calls?
PPS: I changed this to a CL_ARGB, and updated the kernel to handle 4 consecutive pixels. Same performances
Thanks a lot for your help
I am already testing a few different work-group size and always taking the best score in my benchmarks.
I will try to profile for bank conflicts. I was wondering if the number of cycles taken per global read is public.
GTX 570 has 320-bit memory bus, 7850 has 256-bit memory bus. Given that your image is 512x512, you are more likely to run into bank/channel conflicts on 7850 than GTX 570.
One experiment you can try is to reduce your image to something small, like 32x32, but keep your kernel workload the same. This way the image will fit into cache and bank/channel conflicts won't be an issue. Then you can see how performance is affected.
When you tried local memory, did you manage to avoid LDS bank conflicts?
Thanks for your help Jeff.
Here is the result of the profiler when using local memory.
I can't get the profiler to output occupancy. This option is checked, but the collumn is not present in the result table!
I was advised to pack avoid interlacing ALU computation and MEM fetches. Packing my image_readi() calls provided a notable speedup.
Regarding the bus size, should I expect a ~50% speedup with a 320-bit bus?
Kernel Occupancy for HD7000 series will be supported in the APP Profiler 2.5.
Based your GPR, LDS and workgroup size number, the compute unit occupancy won't be high, reduce VGPR usage may increase occupancy and hide mem latency.