AnsweredAssumed Answered

Speed-up OpenCL CPU->GPU

Question asked by christolb29 on Apr 25, 2012
Latest reply on Apr 27, 2012 by christolb29

Hello,

 

I am using OpenCL to perform basic picture analysis, which is used for h264 encoding. The idea would be to perform these operations on GPU, because the computation seems to fit well for GPU computation. However, even after trying to optimize my kernel for GPU, I measure the same performances between OpenCL CPU and OpenCL GPU, and globally much lower than OpenMP version.

 

The CPU is a sandy bridge i7 2600 @ 3.4Ghz (quite strong), and the GPU is ATI HD5570.

 

Here is how I perform the compuation in my GPU version kernel:

 

__kernel void ker1_MIX_c_img2d (

const int stride,

read_only image2d_t pix,

read_only image2d_t pix1,

read_only image2d_t pix2,

__global uint* p_fladIntra_sum,

__global uint* p_fladInter1_sum,

__global uint* p_fladInter2_sum,

__global uint* p_fldc_sum,

__local int4* localmem_fladIntra,

__local int4* localmem_fladInter1,

__local int4* localmem_fladInter2,

__local int4* localmem_fldc,

__global int* nb_workgroup,

const int rest_x,

const int rest_y)

{

 

const sampler_t samplerA = CLK_NORMALIZED_COORDS_FALSE |

                              CLK_ADDRESS_CLAMP |

                              CLK_FILTER_NEAREST;

....

 

//Load data and perform FLADInter and FLDC

localmem_fladInter1[l_j*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) - read_imagei(pix1, samplerA, (int2)(g_i, g_j))));

localmem_fladInter2[l_j*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) - read_imagei(pix2, samplerA, (int2)(g_i, g_j))));

localmem_fldc[l_j*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) ));

 

barrier(CLK_LOCAL_MEM_FENCE);

 

/*Then make the reduction for each work group....*/

...

 

As you can see, I just make differences between pixels, over HD images (1920*1080) stored into image2d object. I measure that this section is the one which cost the most.

I tried different size of workgroup, and kept the one which provide the best performances (4 picture rows per work group).

 

The only difference with CPU version is that I use int16 vectors and vload with CPU, which give low performance with GPU.

 

Is my result normal, or I could get much more speed with some optimization or trick? Should I use another graphic card to plug my screen on the computer where I make my measures, not to distrub the GPU while working? (I measured that basic stdout printf from the host program while running the kernel on GPU, greatly affects the performances!).

 

Thank you for your help.

Outcomes