Hi, I'm trying to optimize an image correlation algorithm that computes dot products between the different intersections of two float4 images. On the CPU OpenCL achieves ~90% L1 bandwidth, on an HD 4850 I'm seeing 70% L1 bandwidth using GLSL and around 50% with OpenCL (I suppose it's slower because the OpenCL array fetches are uncached and I have to use terrible caching hacks.) I'm wondering how the 5xxx-series manage on this algo and if it's possible to reach 90% L1 bandwidth (or more) on the GPU...
I have the code up at github:
http://github.com/kig/correlate_opencl
correlate.cl is the GPU kernel, correlate2.cl is the CPU OpenCL kernel, correlate.fs is the GLSL kernel, correlate_naive.cl is a naive GPU kernel and correlate_image.cl is an untested GPU kernel that uses images (untested because I have no hardware with image support.)