cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

zhuzxy
Journeyman III

Can anyone give me some advice on the cl code optimize?

   my code need do some calculations for all the points inside the image, my cl implementation is like the following, each work item deal with 1 point, and the var 'pixel' is a int16 vector, it first be initialized from the global memory ( the image). and compare the vector value with 2 threshold.

    // initialize the pixel from the image data.

        pixel.s0 = src_image[ x_value1 + (y_value1)* image_width];
        pixel.s1 = src_image[ x_value2 + 1 + (y_value2) * image_width];
        ...

        pixel.se = src_image[ x_valuee + (y_valuee)* image_width];
        pixel.sf =  src_image[x_valuef + (y_valuef)* image_width];

        corner_decision = false;

       //decide if there's continuous 9 points larger/smaller than the thresholds.

    // cb  , c_b are the thresholds

        int8 data=pixel.s01234567;

        corner_decision  = (corner_decision  | ( ( ( all(data > cb)  && ( (pixel.s8 > cb.s0) ||(pixel.sf > cb.s0))) ||
             ( all(data < c_b) && ( (pixel.s8 < c_b.s0)||(pixel.sf <  c_b.s0))) ) ));

        data=pixel.s23456789;

    ...

        data=pixel.s89abcdef0;

        corner_decision  = (corner_decision  | ( ( ( all(data > cb)  && ( (pixel.s8 > cb.s0) ||(pixel.sf > cb.s0))) ||
             ( all(data < c_b) && ( (pixel.s8 < c_b.s0)||(pixel.sf <  c_b.s0))) ) ));

  ...

    if (corner_decision != fase)

      final_res[pos] = 1;

  return;

The problem is when I run it on the GPGPU, the performance is not better than single core CPU on A8-3850 platform. Could anyone give me some advice on optimize directions?

0 Likes
5 Replies
maximmoroz
Journeyman III

Maybe because it is memory transfer bound kernel? Are you using local memory? It might help.

0 Likes

I did not use local memory, I have tried once and got a worse result( maybe need more optimize on the local mem usage). But I have tried 2 kernels, first is each work item do 1 point, and the other is 1 work item deal with 4 points.  I am assuming if there's mem bandwidth problem, the second kernel should has improvement, because the overlap of memread in the 4 points case. After adjust the workgroup sz, the 2 kernel performance is almost the same.

 

A question, how do I know bottleneck in the kernel? how do I know which part of the kernel cause the performance bad? Can anyone give me some advice on it?

0 Likes

Did you read AMD APP OpenCL programming guide? It is a must read for all programming in OpenCL.

0 Likes

Originally posted by: zhuzxy I did not use local memory, I have tried once and got a worse result( maybe need more optimize on the local mem usage). But I have tried 2 kernels, first is each work item do 1 point, and the other is 1 work item deal with 4 points.  I am assuming if there's mem bandwidth problem, the second kernel should has improvement, because the overlap of memread in the 4 points case. After adjust the workgroup sz, the 2 kernel performance is almost the same.A question, how do I know bottleneck in the kernel? how do I know which part of the kernel cause the performance bad? Can anyone give me some advice on it?

Use AMD APP Profiler coming from SDK.

0 Likes
notzed
Challenger

There's too many variables in the code you provided: it could be accessing a tight bunch of values adjacent to each other, or they could be 16 values from all over the place.  The difference matters, so without the actual code ... who knows.

Also - your code is really slow to start with.  &&, and ||  are shortcut evaluations for scalar values, which add completely unecessary branches:

     corner_decision  = (corner_decision  | ( ( ( all(data > cb)  && ( (pixel.s8 > cb.s0) ||(pixel.sf > cb.s0))) ||
             ( all(data < c_b) && ( (pixel.s8 < c_b.s0)||(pixel.sf <  c_b.s0))) ) ));

Will compile to a LOT of code.

Something like this might be better (note: this isn't quite correct, it only wraps at the end, not the start)

 corner_decision |= all((pixel.s012345678 > cb.s012345670) | (pixel.s012345678 > c_b.s012345670));

Although to be honest there are probably more efficient ways to perform this test rather than testing every possible combination.

Something like the following only takes 18 steps rather than 10*18 (assuming cb, c_b are scalar, if they are not, then things are somewhat different)

 int corner = 0;

 int count = 0;

 int cmp;

// repeat the following for every index from f, then 0-f, and then back to 0:

{

 cmp = (pixel.sf > cb) &  (pixel.sf < c_b);

 count = cmp ? count + 1 : 0;

 corner |= count >=9;

}

 

0 Likes