0 Replies Latest reply on Aug 6, 2010 12:43 PM by ilmarih

    Optimizing a repetitive dot product between two float4 images

    ilmarih

       

      Hi, I'm trying to optimize an image correlation algorithm that computes dot products between the different intersections of two float4 images. On the CPU OpenCL achieves ~90% L1 bandwidth, on an HD 4850 I'm seeing 70% L1 bandwidth using GLSL and around 50% with OpenCL (I suppose it's slower because the OpenCL array fetches are uncached and I have to use terrible caching hacks.) I'm wondering how the 5xxx-series manage on this algo and if it's possible to reach 90% L1 bandwidth (or more) on the GPU...

      I have the code up at github: 

      http://github.com/kig/correlate_opencl

      correlate.cl is the GPU kernel, correlate2.cl is the CPU OpenCL kernel, correlate.fs is the GLSL kernel, correlate_naive.cl is a naive GPU kernel and correlate_image.cl is an untested GPU kernel that uses images (untested because I have no hardware with image support.)