Archives Discussions

christolb29 · ‎04-25-2012

Hello,

I am using OpenCL to perform basic picture analysis, which is used for h264 encoding. The idea would be to perform these operations on GPU, because the computation seems to fit well for GPU computation. However, even after trying to optimize my kernel for GPU, I measure the same performances between OpenCL CPU and OpenCL GPU, and globally much lower than OpenMP version.

The CPU is a sandy bridge i7 2600 @ 3.4Ghz (quite strong), and the GPU is ATI HD5570.

Here is how I perform the compuation in my GPU version kernel:

__kernel void ker1_MIX_c_img2d (

const int stride,

read_only image2d_t pix,

read_only image2d_t pix1,

read_only image2d_t pix2,

__global uint* p_fladIntra_sum,

__global uint* p_fladInter1_sum,

__global uint* p_fladInter2_sum,

__global uint* p_fldc_sum,

__local int4* localmem_fladIntra,

__local int4* localmem_fladInter1,

__local int4* localmem_fladInter2,

__local int4* localmem_fldc,

__global int* nb_workgroup,

const int rest_x,

const int rest_y)

{

const sampler_t samplerA = CLK_NORMALIZED_COORDS_FALSE |

CLK_ADDRESS_CLAMP |

CLK_FILTER_NEAREST;

....

//Load data and perform FLADInter and FLDC

localmem_fladInter1[l_j*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) - read_imagei(pix1, samplerA, (int2)(g_i, g_j))));

localmem_fladInter2[l_j*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) - read_imagei(pix2, samplerA, (int2)(g_i, g_j))));

localmem_fldc[l_j*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) ));

barrier(CLK_LOCAL_MEM_FENCE);

/*Then make the reduction for each work group....*/

...

As you can see, I just make differences between pixels, over HD images (1920*1080) stored into image2d object. I measure that this section is the one which cost the most.

I tried different size of workgroup, and kept the one which provide the best performances (4 picture rows per work group).

The only difference with CPU version is that I use int16 vectors and vload with CPU, which give low performance with GPU.

Is my result normal, or I could get much more speed with some optimization or trick? Should I use another graphic card to plug my screen on the computer where I make my measures, not to distrub the GPU while working? (I measured that basic stdout printf from the host program while running the kernel on GPU, greatly affects the performances!).

Thank you for your help.

notzed · ‎04-26-2012

Greyscale eh? Well you're doing way more work than necessary. I presume your image format is CL_UNSIGNED_INT8 and not INT32.

You're keeping 4 independent partial sums until the end when you sum them up, but you could sum them before storing them into local memory and thus require 1/4 the local memory and 1/4 of the alu ops for the reduction. It should also remove the local bank conflicts (but see the end).

If you need to get it working cross-platform, I would start working with that and hardcoding the local work-group size (e.g. use #define TILE_WIDTH 128, etc) as this will allow the compiler to pre-calculate most of the addressing which could be significant here. I would also just try a simpler kernel that only does one difference at a time; although by changing the internal loop to track 3 rather than 12 values you will relieve register pressure significantly so it might not be a help as it would require more memory bandwidth. I'd probably try floats too - using UNORM_INT8 as the data format and read_imagef(), gpu's are good at floats: I don't think rounding errors will matter here.

(on your other question regarding wider int vectors: usually just choose the native size for the devices, which is normally 128 bits wide, i.e. int4, float4, char16, etc - the driver can be queried for this but all current(?) devices are like this, any wider and it's just the same as unrolling the loop once which may be worse since you're reducing the parallelism by half).

On AMD only, I would definitely consider using byte arrays rather than images, and then use amd_sad for the summation, it handles the overflow for you. If you try amd_sad, use uchar16 as the interface size (obviously this requires the data be 16-byte-multiple-wide, but should give you the peak performance). If using arrays you have to do all the range checking yourself. I would probably at least try a work-size of 64x1 for this, as it lets the compiler remove barriers (use the reqd_work_group_size annotation)

I would definitely hard-code the local workgroup size, which lets you simplify the interface and coding as well (e.g. you can just define the reduction array inline). The compiler will automatically unroll the loop if you give it a constant loop specification - i've had a great deal of problems using #pragma unroll with amd's compiler in the past, but i don't know if it's been fixed now - but generally it isn't required for small hard-coded compile-time-decipherable loops which you have here.

Finally, for your reduction, just do it with simple flat addressing. It doesn't matter what order you read the data when you sum it so long as each item is read once. e.g. use len=width*height; then just iterate over that.

And I noticed you have 2 separate data sizes here, intra-frames are smaller. I would definitely consider moving that to a separate routine since it would simplify the execution paths.

Oh.

Oh ... a really big thing I didn't notice (and needed the full source to see!!) NEVER EVER use column-major addressing unless you absolutely have to and know what you're doing, if you get the numbers wrong you're going to totally kill performance - this alone could be the source of your abysmal performance. You could easily get 1/32(!!!) of the LDS performance if your data-width is a power of 2>=32 (i.e. as you have) Always do row-major order in the order of the dimension number (i.e. 0, 1, 2).

e.g.these (and all the rest) must be swapped ( 1 <> 0)

int l_i= get_local_id(1);

int l_j = get_local_id(0);

So your local (2d) addressing should always look something like:

unsigned int lid = get_local_id(1) * get_local_size(0) + get_local_id(0);

(where get_local_size(0) should be hard-coded to a define if you can)

This would also adversely affect your opencl/CPU results as the data wont be accessed serially but with a (very) cache-unfriendly stride.

This might also be why 128x2 works out better than 16x16 (which is counter-intuitive, but still might just be a coincidence). I'd also try 8x8 with a reqd_work_group_size set as removing the barriers might be a win (I'm only presuming your hardware supports this).

View solution in original post

Archives Discussions

Speed-up OpenCL CPU->GPU