4 Replies Latest reply on Jul 30, 2013 1:30 AM by himanshu.gautam

    Kernel performance variability with the same code and same data

    digbug

      Platform:

      OS: win7 64bit

      SDK: APP SDK 2.8.1

      CodeXL: latested

      Driver: Catelyst 13.4

       

      Problems:

      100 times kernel execution with same input data, there are 2-3 Kernel (besides the first one) execution time is significantly larger than others. CodeXL reports allmost the same GPU behavior.

      Would like to get help on how to fix that, is that a runtime problem or memory access pattern or something else?

       

      Sample code:

      "

      __constant sampler_t imageSampler  = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
      __constant sampler_t imageSampler2 = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR;

      __kernel void calBilinearImage(__write_only image2d_t dstimg, __read_only image2d_t srcimg, int2 srcSize, float2 rate,int2 dstSize)
      {
      float2 gid = (float2)(get_global_id(0), get_global_id(1));
      if (gid.x >= dstSize.x || gid.y >= dstSize.y) 
        return;

      float4 temp = clamp(read_imagef(srcimg, imageSampler2, (gid + 0.4999999f) * rate), 0.0f, 1.0f);
      write_imagef(dstimg, (int2)(gid.x, gid.y), temp);

      };

      "

       

      CodeXL reports:

      calBilinearImage__k1_Turks16475664 {   1280     960       1} {   16    16     1}0.8351102 NA011920017
      calBilinearImage__k1_Turks17475673 {   1280     960       1} {   16    16     1}0.83402 NA011920017
      calBilinearImage__k1_Turks18475682 {   1280     960       1} {   16    16     1}3.1394402 NA011920017
      calBilinearImage__k1_Turks19475691 {   1280     960       1} {   16    16     1}0.8157802 NA011920017
      calBilinearImage__k1_Turks1104756100 {   1280     960       1} {   16    16     1}0.8134402 NA011920017
        • Re: Kernel performance variability with the same code and same data
          himanshu.gautam

          Hi,

          Does this mean that some kernel invocations are taking up much more time than others?

          Is it a CodeXL Issue or do you see such performance drops while actually running the kernel in your application?

          Probably you can share a complete testcase, which can be compiled at our end.

            • Re: Kernel performance variability with the same code and same data
              digbug

              I’m not sure if or not what CodeXL reported Kernel time includes Kernel invocation time + Kernel execution time.

               

              If it does, then I’ll focus on Kernel invocation time and maybe it’s OCL runtime problem. If it doesn’t, I will be totally confused.

               

               

               

              The testcase is isolated from a real video post-processing application which sometimes can’t run very smoothly. The ISV located the root cause then write a test case to duplicate it. I attached the test case.

               

               

               

              Another weird issue is: for some cards, actually only on HD 6670, the test case gave a shorter execution time with 14401080 (2.25x enlarged from a 640480 frame) frames while 1280960 (2x enlarged from the same 640480 frame)  frames takes longer time.

               

               

               

              Thanks for your help.

                • Re: Kernel performance variability with the same code and same data
                  himanshu.gautam

                  Well a few issues I am having here:

                  1. It is a chinese project. So hard to understand comments, and read-me files.

                  2. The project is not compiling for me as of now. I am trying it in VS12 Ultimate. And it is giving me mfc100d.lib not found error. Probably some of the libraries used in the project were built using older VS.

                  It will be helpful if you can send a minimal repro-case, without such dependencies. Is it must for me to compile your code VS10?