Optimizing a repetitive dot product between two float4 images

Discussion created by ilmarih on Aug 6, 2010


Hi, I'm trying to optimize an image correlation algorithm that computes dot products between the different intersections of two float4 images. On the CPU OpenCL achieves ~90% L1 bandwidth, on an HD 4850 I'm seeing 70% L1 bandwidth using GLSL and around 50% with OpenCL (I suppose it's slower because the OpenCL array fetches are uncached and I have to use terrible caching hacks.) I'm wondering how the 5xxx-series manage on this algo and if it's possible to reach 90% L1 bandwidth (or more) on the GPU...

I have the code up at github: is the GPU kernel, is the CPU OpenCL kernel, correlate.fs is the GLSL kernel, is a naive GPU kernel and is an untested GPU kernel that uses images (untested because I have no hardware with image support.)