Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Optimisation tips for fetch intensive kernel on ATI

Dear OpenCL users, I recently ported a kernel from CUDA to OpenCL.

This kernel process a 2D image (~512²) and for each pixel, fetch ~8000 coordinates in global memory

Then it will fetch ~8000 times in the 2D image using this coordinates.

The kernel analyzer says the bottleneck is mem fetches, not ALUs

On Nvidia 570, kernel has identical performances in CUDA or OpenCL

When running a Radeon 7850 (I think performances should be close to the GX570), code is 5 times slower.

I changed my code to use shared memory and reduce the amount of global memory fetches.

Now the kernel analyzer says the bottleneck is ALU Ops.

But the 7850 is still 2.5x times slower that the GTX570.

Any tips regarding:

- the reason why ATI is slower for this kind of kernel

- optimization of this Kernel for ATI (my coordinates array is constant for all kernel launches)

PS: the 2D image is in fact a 32bit greyscale pic.

I'm currently using a CL_R - CL_SIGNED_INT32 image format.

Could this explain bad performances of my read_imagei() calls?

PPS: I changed this to a CL_ARGB, and updated the kernel to handle 4 consecutive pixels. Same performances :(

Thanks a lot for your help

9 Replies

Hi P.M,

Did you found some bank conflict in profiler result?Maybe you can modify the  work-group size and have a test.


Journeyman III

I am already testing a few different work-group size and always taking the best score in my benchmarks.

I will try to profile for bank conflicts. I was wondering if the number of cycles taken per global read is public.


GTX 570 has 320-bit memory bus, 7850 has 256-bit memory bus. Given that your image is 512x512, you are more likely to run into bank/channel conflicts on 7850 than GTX 570.

One experiment you can try is to reduce your image to something small, like 32x32, but keep your kernel workload the same.  This way the image will fit into cache and bank/channel conflicts won't be an issue.  Then you can see how performance is affected.

When you tried local memory, did you manage to avoid LDS bank conflicts?


Thanks for your help Jeff.

Here is the result of the profiler when using local memory.

I can't get the profiler to output occupancy. This option is checked, but the collumn is not present in the result table!


I was advised to pack avoid interlacing ALU computation and MEM fetches. Packing my image_readi() calls provided a notable speedup.

Regarding the bus size, should I expect a ~50% speedup with a 320-bit bus?


Not really.  First, 320 is only 25% larger than 256   Second, the clock rate of the bus matters as well.

Check your code for LDS bank conflicts.



Now that I use shared memory and pack my read_imagei() calls, the profiler says that I have 0.0 bank conflicts.


It's not related to fetch optimization but your VALUUtilization number is very low. It means there is big divergence within a wavefront.


Kernel Occupancy for HD7000 series will be supported in the APP Profiler 2.5.

Based your GPR, LDS and workgroup size number, the compute unit occupancy won't be high, reduce VGPR usage may increase occupancy and hide mem latency.

Journeyman III

Thanks again. How would you reduce VGPR? Any tips on that topic?