I've written a basic image convolution OpenCL kernel and OpenGL fragment shader and analyzed them using the APP KernelAnalyzer (Version 1.12.1288) and GPU ShaderAnalyzer (Version 1.59.3208) using their default settings.
Basic Image Convolution in OpenCL:
__constant sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE
__constant const int max_w_size = 25;
__kernel void convolution_kernel(const int w_size,
read_only image2d_t src_image,
__constant const int2 *h,
__constant const float *w,
write_only image2d_t dst_image)
int2 x = (int2)(get_global_id(0), get_global_id(1));
float4 convolution = (float4)(0);
for (int i = 0; i < w_size; ++i)
convolution += w * read_imagef(src_image, sampler, x + h);
write_imagef(dst_image, x, convolution);
Compiler Statistics (Using CAL 12.4)
|Name||Code||GPR||Scratch Reg||Min||Max||Avg||ALU||Fetch||Write||Est Cycles||ALU:Fetch||BottleNeck||Treads\Clock||Throughput|
|Radeon HD 6970||Cayman||7||0||1.00||357.00||43.69||15||3||1||38.55||0.30||Global Fetch||0.42||365 M Threads\Sec|
Basic Image Convolution in OpenGL:
const int max_w_size = 25;
uniform int w_size;
uniform sampler2D src_image;
uniform vec2 h[max_w_size];
uniform float w[max_w_size];
in vec2 x;
out vec4 convolution;
vec4 convolution = vec4(0);
for (int i = 0; i < w_size; ++i)
convolution += w * texture(src_image, x + h);
Compiler Statistics (Using Catalyst 12.4)
|Name||Code||GPR||Scratch Reg||Min||Max||Avg||ALU||TEX||VEX||Est Cycles(Bi)||ALU:TEX(Bi)||BottleNeck(Bi)||Pixels\Clock(Bi)||Throughput(Bi)|
|Radeon HD 6970||Cayman||0||0||1.00||21.33||4.72||5||0||0||4.72||4.72||ALU Ops||3.39||2983 MPixels\Sec|
Why is it that the same algorithm seems to work an order of magnitude slower in OpenCL than in OpenGL? Is there something inherent to the different compute models that accounts for this difference?
Images are stored in memory in tiled format. When you use pixel shader mode ( opengl ) workitems in workgroup are indexed in the same tiled format so you can have best cache reuse. In your opencl code you index workitems in classical "linear" way. This cause L1 cash trashing and much slower code.
PS. Usually it's not enough to make image format linear to achieve high performance in opencl. Much depends on a kernel.
That brings up a good point. Nowhere in the APP KernelAnalyzer can you specify global or local sizes. What settings are is it using as a basis for the performance estimates?
Is there any way to query the tile information of the image and use it to transform the linear indices into the image's tiled format indices? If not, what performance benefit is there from OpenCL/OpenGL interoperability if they can't communicate without taking big performance hits from different formats and indexing?
In my experience, local work size can have a dramatic effect on performance. If you pass just NULL the OpenCL implementation will determine how to be break work-items into work-groups, and if it sticks to eg 256x1, that could be bad for your image memory access/cache etc.
From the APP SDK Guide:
188.8.131.52 Work-Group Dimensions vs Size
The total number of work-items in the work-group is typically the most important parameter to consider, in particular when optimizing to hide latency by increasing wavefronts/compute unit. However, the choice of XYZ dimensions for the same overall work-group size can have the following second-order effects.
• Work-items in the same quarter-wavefront execute on the same cycle in the processing engine. Thus, global memory coalescing and local memory bank conflicts can be impacted by dimension, particularly if the fast-moving X dimension is small. Typically, it is best to choose an X dimension of at least 16, then optimize the memory patterns for a block of 16 work-items which differ by 1 in the X dimension.
• Work-items in the same wavefront have the same program counter and execute the same instruction on each cycle. The packing order can be important if the kernel contains divergent branches. If possible, pack together work-items that are likely to follow the same direction when control-flow is encountered. For example, consider an image-processing kernel where each work-item processes one pixel, and the control-flow depends on the color of the pixel. It might be more likely that a square of 8x8 pixels is the same color than a 64x1 strip; thus, the 8x8 would see less divergence and higher performance.
• When in doubt, a square 16x16 work-group size is a good start.
I guess you should try and measure the runtime of the cl code with different wg sizes if the APP KernelAnalyzer can't handle the local work size param.