AnsweredAssumed Answered

Image Convolution Throughput: OpenCL vs. OpenGL

Question asked by settle on May 3, 2012
Latest reply on May 3, 2012 by shaq

I've written a basic image convolution OpenCL kernel and OpenGL fragment shader and analyzed them using the APP KernelAnalyzer (Version 1.12.1288) and GPU ShaderAnalyzer (Version 1.59.3208) using their default settings.


Basic Image Convolution in OpenCL:

__constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE

                                             | CLK_ADDRESS_CLAMP

                                             | CLK_FILTER_LINEAR;


__constant const int max_w_size = 25;


__kernel void convolution_kernel(const int w_size,

                                               read_only image2d_t src_image,

                                               __constant const int2 *h,

                                               __constant const float *w,

                                               write_only image2d_t dst_image)


    int2 x = (int2)(get_global_id(0), get_global_id(1));


    float4 convolution = (float4)(0);

    for (int i = 0; i < w_size; ++i)

        convolution += w[i] * read_imagef(src_image, sampler, x + h[i]);

    write_imagef(dst_image, x, convolution);



Compiler Statistics (Using CAL 12.4)

NameCodeGPRScratch RegMinMaxAvgALUFetchWriteEst CyclesALU:FetchBottleNeckTreads\ClockThroughput
Radeon HD 6970Cayman701.00357.0043.69153138.550.30Global Fetch0.42365 M Threads\Sec


Basic Image Convolution in OpenGL:

const int max_w_size = 25;


uniform int w_size;


uniform sampler2D src_image;


uniform vec2 h[max_w_size];


uniform float w[max_w_size];


in vec2 x;


out vec4 convolution;


void main(void)


    vec4 convolution = vec4(0);

    for (int i = 0; i < w_size; ++i)

        convolution += w[i] * texture(src_image, x + h[i]);



Compiler Statistics (Using Catalyst 12.4)

NameCodeGPRScratch RegMinMaxAvgALUTEXVEXEst Cycles(Bi)ALU:TEX(Bi)BottleNeck(Bi)Pixels\Clock(Bi)Throughput(Bi)
Radeon HD 6970Cayman001.0021.334.725004.724.72ALU Ops3.392983 MPixels\Sec


Why is it that the same algorithm seems to work an order of magnitude slower in OpenCL than in OpenGL?  Is there something inherent to the different compute models that accounts for this difference?