cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

christolb29
Journeyman III

Speed-up OpenCL CPU->GPU

Hello,

I am using OpenCL to perform basic picture analysis, which is used for h264 encoding. The idea would be to perform these operations on GPU, because the computation seems to fit well for GPU computation. However, even after trying to optimize my kernel for GPU, I measure the same performances between OpenCL CPU and OpenCL GPU, and globally much lower than OpenMP version.

The CPU is a sandy bridge i7 2600 @ 3.4Ghz (quite strong), and the GPU is ATI HD5570.

Here is how I perform the compuation in my GPU version kernel:

__kernel void ker1_MIX_c_img2d (

const int stride,

read_only image2d_t pix,

read_only image2d_t pix1,

read_only image2d_t pix2,

__global uint* p_fladIntra_sum,

__global uint* p_fladInter1_sum,

__global uint* p_fladInter2_sum,

__global uint* p_fldc_sum,

__local int4* localmem_fladIntra,

__local int4* localmem_fladInter1,

__local int4* localmem_fladInter2,

__local int4* localmem_fldc,

__global int* nb_workgroup,

const int rest_x,

const int rest_y)

{

 

const sampler_t samplerA = CLK_NORMALIZED_COORDS_FALSE |

                              CLK_ADDRESS_CLAMP |

                              CLK_FILTER_NEAREST;

....

//Load data and perform FLADInter and FLDC

localmem_fladInter1[l_j*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) - read_imagei(pix1, samplerA, (int2)(g_i, g_j))));

localmem_fladInter2[l_j*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) - read_imagei(pix2, samplerA, (int2)(g_i, g_j))));

localmem_fldc[l_j*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) ));

barrier(CLK_LOCAL_MEM_FENCE);

/*Then make the reduction for each work group....*/

...

As you can see, I just make differences between pixels, over HD images (1920*1080) stored into image2d object. I measure that this section is the one which cost the most.

I tried different size of workgroup, and kept the one which provide the best performances (4 picture rows per work group).

The only difference with CPU version is that I use int16 vectors and vload with CPU, which give low performance with GPU.

Is my result normal, or I could get much more speed with some optimization or trick? Should I use another graphic card to plug my screen on the computer where I make my measures, not to distrub the GPU while working? (I measured that basic stdout printf from the host program while running the kernel on GPU, greatly affects the performances!).

Thank you for your help.

0 Likes
1 Solution

Greyscale eh?  Well you're doing way more work than necessary.  I presume your image format is CL_UNSIGNED_INT8 and not INT32.

You're keeping 4 independent partial sums until the end when you sum them up, but you could sum them before storing them into local memory and thus require 1/4 the local memory and 1/4 of the alu ops for the reduction.  It should also remove the local bank conflicts (but see the end).

If you need to get it working cross-platform, I would start working with that and hardcoding the local work-group size (e.g. use #define TILE_WIDTH 128, etc) as this will allow the compiler to pre-calculate most of the addressing which could be significant here.  I would also just try a simpler kernel that only does one difference at a time; although by changing the internal loop to track 3 rather than 12 values you will relieve register pressure significantly so it might not be a help as it would require more memory bandwidth.  I'd probably try floats too - using UNORM_INT8 as the data format and read_imagef(), gpu's are good at floats: I don't think rounding errors will matter here.

(on your other question regarding wider int vectors: usually just choose the native size for the devices, which is normally 128 bits wide, i.e. int4, float4, char16, etc - the driver can be queried for this but all current(?) devices are like this, any wider and it's just the same as unrolling the loop once which may be worse since you're reducing the parallelism by half).

On AMD only, I would definitely consider using byte arrays rather than images, and then use amd_sad for the summation, it handles the overflow for you.  If you try amd_sad, use uchar16 as the interface size (obviously this requires the data be 16-byte-multiple-wide, but should give you the peak performance).  If using arrays you have to do all the range checking yourself.  I would probably at least try a work-size of 64x1 for this, as it lets the compiler remove barriers (use the reqd_work_group_size  annotation)

I would definitely hard-code the local workgroup size, which lets you simplify the interface and coding as well (e.g. you can just define the reduction array inline).  The compiler will automatically unroll the loop if you give it a constant loop specification - i've had a great deal of problems using #pragma unroll with amd's compiler in the past, but i don't know if it's been fixed now - but generally it isn't required for small hard-coded compile-time-decipherable loops which you have here.

Finally, for your reduction, just do it with simple flat addressing.  It doesn't matter what order you read the data when you sum it so long as each item is read once.   e.g. use len=width*height; then just iterate over that.

And I noticed you have 2 separate data sizes here, intra-frames are smaller.  I would definitely consider moving that to a separate routine since it would simplify the execution paths.

Oh.

Oh ... a really big thing I didn't notice (and needed the full source to see!!) NEVER EVER use column-major addressing unless you absolutely have to and know what you're doing, if you get the numbers wrong you're going to totally kill performance - this alone could be the source of your abysmal performance.  You could easily get 1/32(!!!) of the LDS performance if your data-width is a power of 2>=32 (i.e. as you have)  Always do row-major order in the order of the dimension number (i.e. 0, 1, 2).

e.g.these (and all the rest) must be swapped ( 1 <> 0)

int l_i= get_local_id(1);

int l_j = get_local_id(0);

So your local (2d) addressing should always look something like:

unsigned int lid = get_local_id(1) * get_local_size(0) + get_local_id(0);

(where get_local_size(0) should be hard-coded to a define if you can)

This would also adversely affect your opencl/CPU results as the data wont be accessed serially but with a (very) cache-unfriendly stride.

This might also be why 128x2 works out better than 16x16 (which is counter-intuitive, but still might just be a coincidence).  I'd also try 8x8 with a reqd_work_group_size set as removing the barriers might be a win (I'm only presuming your hardware supports this).

View solution in original post

0 Likes
12 Replies
Wenju
Elite

Hi chiristolb29,

I think you should change a graphic card,maybe 5800 or better.

0 Likes

Thank you for you anwer.

Yes, I plan to test with another GPU to check if the problem (if there is) is from the driver or something. Using a stronger GPU should enchance the performances indeed.

However the question is more about the optimization of the kernel for "small" GPU, and if it's normal to get such a "bad" result.

In this case I need 3ms to perform the kernel execution on the GPU, whereas I do the same in 150us with OpenMP on the CPU. (I do not take into account the time of memory transfert).

0 Likes
dmeiser
Elite

If your performance bottleneck is really the CPU-GPU transfers there are a couple of strategies you could use is to speed things up. A simple optimization would be to send data for several frames at a time. This would reduce the impact of PCIe latency, but not bandwidth. Another optimization is to employ double buffering and asynchronous data transfers. You process one buffer while the data for the second buffer is being transferred to the GPU. After that you switch: process second buffer and transfer first buffer. This way you can overlap computation with data transfers. Of course you can combine the two.

Have you benchmarked the various parts of your program, in particular kernel execution and data transfer? You should be able to transfer image data for one frame in about 1ms [(10MB data size per frame)/(5GB/s PCIe Bandwidth)=2ms]. I'd guess the data processing is much less than that. If your openmp implementation is much faster than 1ms you'll have a hard time beating that with the GPU.

Thank you for your answer.

For now I focus on the kernel execution optimization, because it is slower than the data transfert (clEnqueueWriteImage). The kernel execution needs about 3ms, whereas the clEnqueueWriteImage needs 1 or 2ms.

Once the kernel execution becomes faster than the data transfert (and I hope it will ^^), I will try to optimize this data transfert, and apply your pipeline idea. And indeed, I don't know if it will be possible to get faster than 1ms.

0 Likes
thesmileman
Journeyman III

Even thought it should hit the cache why are you reading the same pixel three times? Also you are taking the abs value of that three times as well. So try story "abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)" in a variable and don't compute the abs two extra times.

Another thought might be to store off "(int2)(g_i, g_j)" as a int2 variable one then use it. Maybe the compiler will optimize that cast out but maybe not and you have used it 5 times.

Also read_image_i already returns a int4 so there the no need to convert it.

Also how do you know that this section is causing you the biggest performance issue?

0 Likes

The compiler will optimise out reads from read-only sources to a single request - this is a big reason such memory exists as it is easier to optimise

0 Likes

Thank you for your answer.

Indeed, I thought about it, and I made a kernel which first reads all the data I need and store it into local memory. But I did not measure any improvement and I did not understand why. I measure the execution time of my kernel with clGetEventProfilingInfo, so I just throw out the eveything but the part of the code I want to measure.

Your tips are welcome and I could improve the execution by some micro-second, removing the convert.

0 Likes
notzed
Challenger

What is the sort data format?  int4 rgba?  Or ubyte rgba?  AMD has helper functions for 8-bit SAD (amd media ops: amd_sad) which might be a better fit if you're using bytes, although i haven't used them.

For image data, a 2d workgroup size is best as that is how the image memory is organised.  e.g. start at 16x16 and fiddle till you get a faster result.  Also for local memory a 4-element 32-bit data access will cause guaranteed bank conflicts which can be quite expensive. A poorly implemented reduction step could also be costly (how did you determine this fragment was the slow bit?).

If your data doesn't need to be in an image, using an array might be faster for something like this with such a regular access pattern.

Are you timing a whole image load/return?  This might not be 'fair' in every case since you can often pipeline the processing to hide the memory transfers, and ideally intermediate results don't need to go back to the cpu.  And presumably this is only one part of the process, even if you can't pipeline the transfers the data only needs to be transferred once so the transfer time can be amortised over all the times you use it.  But you're not really doing much work so it probably wont be possible to beat a cpu if you have to time the memory transfers and you aren't doing other stuff with the data once it's there.  You could do both too.

I don't know much specifically about h264: is there any relationship between those images or are they all independent frames?  It may be counter-intuitive, but sometimes things are faster if you run smaller kernels N times rather than try to do N things in one (less resource requirements == more alu parallelism).

Do you have any more complete code to look at?   And just what times are you getting?

0 Likes

I am working with grascyale pictures, so 1 byte per pixels. I use RGBA pixels to store 4 pixels, and I use int4 type because I make a reduction on the same data, and using bytes would lead to overflow. Then, I am going to try to make another version of kernel doing reduction differently, to test __local char4* localmem efficiency.

I did not thought about using directly intrisinc, as I thought the compiler would do the same, but it could be a field a investigation indeed.

Here is my full kernel:

//#define SKIP_REDUCTION

 

/*WorkGroup size must be a power of 2*/

__kernel void ker1_MIX_c_img2d (

const int stride,

read_only image2d_t pix,

read_only image2d_t pix1,

read_only image2d_t pix2,

__global uint* p_fladIntra_sum,

__global uint* p_fladInter1_sum,

__global uint* p_fladInter2_sum,

__global uint* p_fldc_sum,

__local int4* localmem_fladIntra,

__local int4* localmem_fladInter1,

__local int4* localmem_fladInter2,

__local int4* localmem_fldc,

__global int* nb_workgroup,

const int rest_x,

const int rest_y)

{

const sampler_t samplerA = CLK_NORMALIZED_COORDS_FALSE |

                              CLK_ADDRESS_CLAMP |

                              CLK_FILTER_NEAREST;

 

int4 tmp_Intra;

int4 tmp_Inter1;

int4 tmp_Inter2;

int4 tmp_Fldc;

int4 vect_zero= (int4)(0);

 

int g_i= get_global_id(1);

int g_j = get_global_id(0);

int l_i= get_local_id(1);

int l_j = get_local_id(0);

 

int gid_i= get_group_id(1);

int gid_j= get_group_id(0);

 

int wg_i=get_num_groups (1);

int wg_j=get_num_groups (0);

 

int l_size_i = get_local_size(1);

int l_size_j = get_local_size(0);

int g_size_i = get_global_size(1);

int g_size_j = get_global_size(0);

 

int localSize= get_local_size(0)*get_local_size(1);

int globalSize= get_global_size(0)*get_global_size(1);

   

//Load data and perform FLAD

if(g_j%2==0 && g_j<g_size_j-1)

localmem_fladIntra[(l_j/2)*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) - read_imagei(pix, samplerA, (int2)(g_i, g_j+1))));

 

localmem_fladInter1[l_j*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) - read_imagei(pix1, samplerA, (int2)(g_i, g_j))));

localmem_fladInter2[l_j*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) - read_imagei(pix2, samplerA, (int2)(g_i, g_j))));

localmem_fldc[l_j*l_size_i+l_i]=convert_int4(abs(read_imagei(pix, samplerA, (int2)(g_i, g_j)) ));

#ifndef SKIP_REDUCTION

 

/* ########################################################################################################*/

barrier(CLK_LOCAL_MEM_FENCE);

/* ########################################################################################################*/

 

// repeat reduction in local memory

for(int s = localSize/2; s > 1; s >>= 1)

{

          //skip non-valide values from partially filled workgroups (last WG of each dimension

          if(l_j*l_size_i+l_i< s)

          {

                    if(s <= localSize/4)

                    localmem_fladIntra[l_j*l_size_i+l_i] += localmem_fladIntra[l_j*l_size_i+l_i+ s];

 

                    localmem_fladInter1[l_j*l_size_i+l_i] += localmem_fladInter1[l_j*l_size_i+l_i+ s];

                    localmem_fladInter2[l_j*l_size_i+l_i] += localmem_fladInter2[l_j*l_size_i+l_i+ s];

                    localmem_fldc[l_j*l_size_i+l_i] += localmem_fldc[l_j*l_size_i+l_i+ s];

 

          }

          // keep barrier outside conditional

 

/* ########################################################################################################*/

          barrier(CLK_LOCAL_MEM_FENCE);

/* ########################################################################################################*/

  }

   

// write result to global memory

if (l_i== 0 && l_j==0)

{

tmp_Intra=(localmem_fladIntra[0] + localmem_fladIntra[1]);

 

tmp_Inter1=(localmem_fladInter1[0] + localmem_fladInter1[1]);

tmp_Inter2=(localmem_fladInter2[0] + localmem_fladInter2[1]);

tmp_Fldc=(localmem_fldc[0] + localmem_fldc[1]);

 

p_fladIntra_sum[gid_j*wg_i + gid_i] =tmp_Intra.s0 + tmp_Intra.s1 + tmp_Intra.s2 + tmp_Intra.s3;

p_fladInter1_sum[gid_j*wg_i + gid_i] =tmp_Inter1.s0 + tmp_Inter1.s1 + tmp_Inter1.s2 + tmp_Inter1.s3;

p_fladInter2_sum[gid_j*wg_i + gid_i] =tmp_Inter2.s0 + tmp_Inter2.s1 + tmp_Inter2.s2 + tmp_Inter2.s3;

p_fldc_sum[gid_j*wg_i + gid_i] =tmp_Fldc.s0 + tmp_Fldc.s1 + tmp_Fldc.s2 + tmp_Fldc.s3;

 

}

  */

#endif

}

I measure the execution time of the kernel for now, with clGetEventProfilingInfo. Then it does not take the clEnqueueWriteImage and clEnqueueReadBuffer into account. The best performance I get with this kernel is arround 3ms (only the kernel execution). And the data transfert of one HD image is about 1 or 2ms. This result is already good enough to perform real-time encoding, but I want to test if this GPU is able to do better (and more computation).

Frames are independent. This analysis aims to detect "fade to black" or "change of scene (or scene cut)" in a video, to help the encoder to work better. This is why I perform differences between pixels.

As you can see, I tried to throw out the reduction, and measured that the execution time was almost the same.

It is weird but I get better performances when using work group size like 128x2 (column x row) than square such as 16x16.

So you think it could be better to run one kernel per function?

Does the compiler vectorize better than 4-pixels length vector, or it could be better to read 4 pixels and store it into an int16, like this:

(localmem_pix[l_j*l_size_i+l_i]).s0123=convert_int4(read_imagei(pix, samplerA, (int2)(g_i, g_j)));

(localmem_pix[l_j*l_size_i+l_i]).s4567=convert_int4(read_imagei(pix, samplerA, (int2)(g_i+1, g_j)));

(localmem_pix[l_j*l_size_i+l_i]).s89ab=convert_int4(read_imagei(pix, samplerA, (int2)(g_i+2, g_j)));

(localmem_pix[l_j*l_size_i+l_i]).scdef=convert_int4(read_imagei(pix, samplerA, (int2)(g_i+3, g_j)));

(My kernel using this does not work  for now, so I cannot measure)

Thank you

0 Likes

Have you tested hard coding your loop paramaters and then using a "#pragma unroll X" where X is the loop execution time. I have found even for relatively small loops this can give me performance increases of 5-15x in some cases. Note:That If the loop is too big it might cause the registers to spill". The difficult part is you need different versions in your case to support local work sizes. I have found that I simply have to deal with it in my code and have lots of versions if I am going to support multiple architectures but the performance gains are too large to ignore. I have a preprocessor that duplicates my code with multiple times for this so that I don't get a typo in one version.

I have not seen a performance increase in int16. Sometimes I get a better result with int8 but sometimes slightly worse depending on what I am doing so you might just need to try.

Also you might check your memory access pattern to make sure it is optimized (check the APP parallel programming guide for more info on how this works best for AMD).

0 Likes

Greyscale eh?  Well you're doing way more work than necessary.  I presume your image format is CL_UNSIGNED_INT8 and not INT32.

You're keeping 4 independent partial sums until the end when you sum them up, but you could sum them before storing them into local memory and thus require 1/4 the local memory and 1/4 of the alu ops for the reduction.  It should also remove the local bank conflicts (but see the end).

If you need to get it working cross-platform, I would start working with that and hardcoding the local work-group size (e.g. use #define TILE_WIDTH 128, etc) as this will allow the compiler to pre-calculate most of the addressing which could be significant here.  I would also just try a simpler kernel that only does one difference at a time; although by changing the internal loop to track 3 rather than 12 values you will relieve register pressure significantly so it might not be a help as it would require more memory bandwidth.  I'd probably try floats too - using UNORM_INT8 as the data format and read_imagef(), gpu's are good at floats: I don't think rounding errors will matter here.

(on your other question regarding wider int vectors: usually just choose the native size for the devices, which is normally 128 bits wide, i.e. int4, float4, char16, etc - the driver can be queried for this but all current(?) devices are like this, any wider and it's just the same as unrolling the loop once which may be worse since you're reducing the parallelism by half).

On AMD only, I would definitely consider using byte arrays rather than images, and then use amd_sad for the summation, it handles the overflow for you.  If you try amd_sad, use uchar16 as the interface size (obviously this requires the data be 16-byte-multiple-wide, but should give you the peak performance).  If using arrays you have to do all the range checking yourself.  I would probably at least try a work-size of 64x1 for this, as it lets the compiler remove barriers (use the reqd_work_group_size  annotation)

I would definitely hard-code the local workgroup size, which lets you simplify the interface and coding as well (e.g. you can just define the reduction array inline).  The compiler will automatically unroll the loop if you give it a constant loop specification - i've had a great deal of problems using #pragma unroll with amd's compiler in the past, but i don't know if it's been fixed now - but generally it isn't required for small hard-coded compile-time-decipherable loops which you have here.

Finally, for your reduction, just do it with simple flat addressing.  It doesn't matter what order you read the data when you sum it so long as each item is read once.   e.g. use len=width*height; then just iterate over that.

And I noticed you have 2 separate data sizes here, intra-frames are smaller.  I would definitely consider moving that to a separate routine since it would simplify the execution paths.

Oh.

Oh ... a really big thing I didn't notice (and needed the full source to see!!) NEVER EVER use column-major addressing unless you absolutely have to and know what you're doing, if you get the numbers wrong you're going to totally kill performance - this alone could be the source of your abysmal performance.  You could easily get 1/32(!!!) of the LDS performance if your data-width is a power of 2>=32 (i.e. as you have)  Always do row-major order in the order of the dimension number (i.e. 0, 1, 2).

e.g.these (and all the rest) must be swapped ( 1 <> 0)

int l_i= get_local_id(1);

int l_j = get_local_id(0);

So your local (2d) addressing should always look something like:

unsigned int lid = get_local_id(1) * get_local_size(0) + get_local_id(0);

(where get_local_size(0) should be hard-coded to a define if you can)

This would also adversely affect your opencl/CPU results as the data wont be accessed serially but with a (very) cache-unfriendly stride.

This might also be why 128x2 works out better than 16x16 (which is counter-intuitive, but still might just be a coincidence).  I'd also try 8x8 with a reqd_work_group_size set as removing the barriers might be a win (I'm only presuming your hardware supports this).

0 Likes

Thank you very much for your help.

- I use CL_UNSIGNED_INT8 data with RGBA to store my grayscale pixels indeed. To my mind, the most important was to compute on vector data type. But as you said, the reduction is faster doing a "pre reduction" on the vector. I get -0.5ms from this tip (on 2.25ms).

- Hardcoding the local workgroup size helps to gain some micro-second also, as I do not need that much flefixiblity, it is welcome.

-The index I was using was the bad way indeed, thanks a lot for that indication. However it still looks more intuitive to use the other way x). But I get -0.6ms using the right way, with 8x8 workgroup size, which is the best size.

-I am not sure to understand why you mean by "flat adrdessing"

Finally, for your reduction, just do it with simple flat addressing.

I am working on re-writing the reduction loop from your indication, and I will also try float data.

Anyway, I reduced my execution time by 55% which is very good =), it's now just below 1ms per frame.

0 Likes