cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

digbug
Journeyman III

Kernel performance variability with the same code and same data

Platform:

OS: win7 64bit

SDK: APP SDK 2.8.1

CodeXL: latested

Driver: Catelyst 13.4

Problems:

100 times kernel execution with same input data, there are 2-3 Kernel (besides the first one) execution time is significantly larger than others. CodeXL reports allmost the same GPU behavior.

Would like to get help on how to fix that, is that a runtime problem or memory access pattern or something else?

Sample code:

"

__constant sampler_t imageSampler  = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
__constant sampler_t imageSampler2 = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR;

__kernel void calBilinearImage(__write_only image2d_t dstimg, __read_only image2d_t srcimg, int2 srcSize, float2 rate,int2 dstSize)
{
float2 gid = (float2)(get_global_id(0), get_global_id(1));
if (gid.x >= dstSize.x || gid.y >= dstSize.y) 
  return;

float4 temp = clamp(read_imagef(srcimg, imageSampler2, (gid + 0.4999999f) * rate), 0.0f, 1.0f);
write_imagef(dstimg, (int2)(gid.x, gid.y), temp);

};

"

CodeXL reports:

calBilinearImage__k1_Turks16475664 {   1280     960       1} {   16    16     1}0.8351102 NA011920017
calBilinearImage__k1_Turks17475673 {   1280     960       1} {   16    16     1}0.83402 NA011920017
calBilinearImage__k1_Turks18475682 {   1280     960       1} {   16    16     1}3.1394402 NA011920017
calBilinearImage__k1_Turks19475691 {   1280     960       1} {   16    16     1}0.8157802 NA011920017
calBilinearImage__k1_Turks1104756100 {   1280     960       1} {   16    16     1}0.8134402 NA011920017
0 Likes
3 Replies
himanshu_gautam
Grandmaster

Hi,

Does this mean that some kernel invocations are taking up much more time than others?

Is it a CodeXL Issue or do you see such performance drops while actually running the kernel in your application?

Probably you can share a complete testcase, which can be compiled at our end.

0 Likes

I’m not sure if or not what CodeXL reported Kernel time includes Kernel invocation time + Kernel execution time.

If it does, then I’ll focus on Kernel invocation time and maybe it’s OCL runtime problem. If it doesn’t, I will be totally confused.

The testcase is isolated from a real video post-processing application which sometimes can’t run very smoothly. The ISV located the root cause then write a test case to duplicate it. I attached the test case.

Another weird issue is: for some cards, actually only on HD 6670, the test case gave a shorter execution time with 14401080 (2.25x enlarged from a 640480 frame) frames while 1280960 (2x enlarged from the same 640480 frame) frames takes longer time.

Thanks for your help.

0 Likes

Well a few issues I am having here:

1. It is a chinese project. So hard to understand comments, and read-me files.

2. The project is not compiling for me as of now. I am trying it in VS12 Ultimate. And it is giving me mfc100d.lib not found error. Probably some of the libraries used in the project were built using older VS.

It will be helpful if you can send a minimal repro-case, without such dependencies. Is it must for me to compile your code VS10?

0 Likes