74 Replies Latest reply on May 27, 2009 7:08 AM by fandango

    Kernel function problem

    fandango

      Hi,

      I have faced with small problem. My kernel function does not work correctly and return partly right result. This is function is very simple and I believe it is not my mistake. Could you please look at my example? What's wrong?

      kernel void func1(unsigned char src[][],  unsigned char str, out unsigned char o_img<>
      {
          // Output position
          int j = instance().x; // width
          int i = instance().y; // height
         
          int rest = j % 16;
         
          if (rest == 0)
          {
              o_img = src [ i] [j];
          }
          else
          {
              o_img = str;
          }
      }

       

      Input:

      3   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   3   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   3   1   1   1   1   1   1   1   1

      I just replace 1 by 9.

      Wrong output:

        3      9   9   9   9   9   9   9   9   9   9   9   9   9   9   3      9   9   9   9   9   9   9   9   9   9   9   9   9   9   3      9   9   9   9   9   9   9

      You can see garbage  in memory. If I remove if statement from the kernel , the output will be without garbage. But I need the first way.

       

       

        • Kernel function problem
          gaurav.garg

          I could not reproduce this issue on my system. Could you send me your system configuration?

            • Kernel function problem
              fandango

              Yes,

              ATI Radeon 4800 HD, driver 8.600.0.0

              Intel Core 2 6300 1.86 GHz, 1 Gb DDR2

               

                • Kernel function problem
                  fandango

                  Chunk of code just in case:


                   // Specifying the size of the 2D stream
                      unsigned int streamSize[] = {width, height};

                      // Specifying the rank of the stream
                      unsigned int rank = 2;
                     
                      brook::Stream<unsigned char> inputStream(rank, streamSize);

                      // Copying data from input buffer to input stream
                      inputStream.read(src);

                      //--------------------------------------------------------------------------
                      // Creating the output stream
                      //--------------------------------------------------------------------------    
                      streamSize[0] = width;
                      streamSize[1] = height;

                      brook::Stream<unsigned char> outputStream(rank, streamSize);

                      //--------------------------------------------------------------------------
                      // Executing kernel and copying back data
                      //--------------------------------------------------------------------------    
                      unsigned char str = '9';
                      
                      // Calling the kernel on the input and output streams
                      func1(inputStream, str, outputStream);

                      // Creating an output buffer
                      unsigned char* ref = new unsigned char[width * height];
                      memset(ref, 0, width * height);


                      // Copying data from output stream to output buffer
                      outputStream.write(ref);

                    • Kernel function problem
                      fandango

                      Any ideas?

                        • Kernel function problem
                          anlongstar

                          no idea, because some mistakes is too strange.

                            • Kernel function problem
                              fandango

                              ok. i will try to reinstall driver or stream sdk.

                                • Kernel function problem
                                  gaurav.garg

                                  I am also using 8.60 driver, but I don't see the issue.

                                  What is your OS and what are the value for width & height?

                                    • Kernel function problem
                                      fandango

                                      I have tried the Windows Xp 32 and Windows Vista 64 SP1. The same result. The latest stream sdk and driver.

                                      I have tried the different w&h. Full code:

                                      #include <iostream>

                                      #include "brook/Stream.h"
                                      #include "brook/KernelInterface.h"
                                      #include "brookgenfiles/kernel.h"


                                      void print(unsigned char* arr, int width, int height)
                                      {
                                          for (int i = 0; i < height; i++)
                                          {
                                              for(int j = 0; j < width; j++)
                                              {
                                                  char cStr[256];
                                                  sprintf(cStr, "% 3c ", arr[j + i * width]);
                                                  OutputDebugString(cStr);
                                              }

                                              OutputDebugString("\n");
                                          }
                                      }

                                      int
                                      main(int argc, char* argv[])
                                      {
                                      // Specifying the width and height of the 2D buffer
                                          const unsigned int width = 49;
                                          const unsigned int height = 6;

                                          //--------------------------------------------------------------------------
                                          // Creating and initializing the input buffer
                                          //--------------------------------------------------------------------------

                                          // Creating an input buffer
                                          unsigned char* src = new unsigned char[width * height];
                                          //memset(src, 7, width * height);

                                          for (int i = 0; i < height; i++)
                                          {
                                              for(int j = 0; j < width; j++)
                                              {
                                                  if (j % 16 == 0)
                                                  {
                                                      src[j + i * width] = '3';
                                                  }
                                                  else
                                                  {
                                                      src[j + i * width] = '1';
                                                  }
                                              }

                                              OutputDebugString("\n");
                                          }

                                           print(src, width, height);

                                          // Initializing the input buffer such that
                                          // input(i,j) = i*width + j
                                      //    fillBuffer(inputBuffer, width, height);

                                          // Printing input buffer
                                          fprintf(stdout, "Input buffer:\n");

                                          //--------------------------------------------------------------------------
                                          // Creating the input stream and copying data from input buffer
                                          //--------------------------------------------------------------------------

                                          // Specifying the size of the 2D stream
                                          unsigned int streamSize[] = {width, height};

                                          // Specifying the rank of the stream
                                          unsigned int rank = 2;

                                          // Create a 2D stream of specified size i.e. 64x64 floating-point values   
                                          brook::Stream<unsigned char> inputStream(rank, streamSize);

                                          // Copying data from input buffer to input stream
                                          inputStream.read(src);

                                          //--------------------------------------------------------------------------
                                          // Creating the output stream
                                          //--------------------------------------------------------------------------   
                                          streamSize[0] = width;
                                          streamSize[1] = height;

                                          brook::Stream<unsigned char> outputStream(rank, streamSize);

                                          //--------------------------------------------------------------------------
                                          // Executing kernel and copying back data
                                          //--------------------------------------------------------------------------   
                                          unsigned char str = '9';
                                         
                                          // Calling the kernel on the input and output streams
                                          func1(inputStream, str, outputStream);

                                          // Creating an output buffer
                                          unsigned char* ref = new unsigned char[width * height];
                                          memset(ref, 0, width * height);
                                          //memset(ref, 0, width * height * sizeof(just));

                                          //print(ref, width, height);

                                          // Copying data from output stream to output buffer
                                          outputStream.write(ref);

                                          print(ref, width, height);

                                          // Check error on stream
                                          if(outputStream.error())
                                          {
                                              // Print error Log associated to stream
                                              fprintf(stdout, "%s\n", outputStream.errorLog());
                                          }

                                          fprintf(stdout, "Output buffer:\n");
                                      //    printBuffer(outputBuffer, width, 0, 0, 8, 8);

                                          //--------------------------------------------------------------------------
                                          // Checking whether the result is correct or not
                                          //--------------------------------------------------------------------------
                                         

                                          //--------------------------------------------------------------------------
                                          // Cleaning up
                                          //--------------------------------------------------------------------------
                                         
                                          delete[] src;
                                          delete[] ref;
                                         
                                          return 0;
                                      }

                                        • Kernel function problem
                                          fandango

                                          I just have noticed one thing. Depence on width and height of streamSize for output brook stream I have the different results. I mean the different garbage location.

                                          I suppose something wrong in my kernel function

                                            • Kernel function problem
                                              fandango

                                              If I set

                                                  streamSize[0] = width;
                                                  streamSize[1] = 1;

                                                  brook::Stream<unsigned char> outputStream(rank, streamSize);

                                              The result is correct. As soon as I set height > 1 the problem is occured.

                                                • Kernel function problem
                                                  fandango

                                                  hmm. I obtained correct result. The changes

                                                  kernel void func1(unsigned char src[],  unsigned char str, unsigned char str2, out unsigned char o_img<>

                                                  src is one dimensional array.

                                                  And I set src rank to 1, dst to 2.

                                                  Please comment on this. What was the reason for the problem? Is it my allocation approach?

                                                  unsigned char* src = new unsigned char[width * height];

                                                  Please respond.

                                                    • Kernel function problem
                                                      genaganna

                                                      is dst height 1?

                                                       

                                                      Are you able to run samples\legacy\tests\sum?

                                                      Let me know sum sample runing or not?

                                                        • Kernel function problem
                                                          fandango

                                                          Yes, I'm able. No problem here. Now the height and width can be any. My example works as I expected. I think it's my misundstanding of conception and I would ask you to explain me what is wrong in my mind.

                                                            • Kernel function problem
                                                              genaganna

                                                              What changes you made to your code?

                                                               

                                                              I did not see any problems with pasted code on the top

                                                                • Kernel function problem
                                                                  fandango

                                                                  The changes were:

                                                                  1. I set stream rank 1 for the src stream instead of 2.

                                                                  unsigned int rank = 1;

                                                                  brook::Stream<unsigned char> inputStream(rank, streamSize);

                                                                   inputStream.read(src);

                                                                  2. I changed accordingly my kernel function. You can see one dimensional src array [], instead of [][] in previous version.

                                                                  kernel void func1(unsigned char src[],  unsigned char str,  out unsigned char o_img<>
                                                                  {
                                                                      // Output position
                                                                      int2 vPos = instance().xy;
                                                                     
                                                                      int j = vPos.x; // width
                                                                      int i = vPos.y; // height
                                                                     
                                                                      int rest = j % 16;
                                                                     
                                                                      if (rest > 0)
                                                                      {
                                                                          o_img = str;
                                                                      }
                                                                      else
                                                                      {
                                                                          o_img = src[j + i * 40];
                                                                      }
                                                                  }

                                                                  That's all.

                                                                    • Kernel function problem
                                                                      fandango

                                                                      constant 40 in code above is width

                                                                      • Kernel function problem
                                                                        genaganna

                                                                        In kernel code, dimension of output is importent

                                                                         

                                                                        you can also use src[][] but in this case both size and dimensions of src and dst must be same

                                                                         

                                                                         

                                                                         

                                                                         

                                                                          • Kernel function problem
                                                                            fandango

                                                                            I did not change of output properties, only input.

                                                                            And the your last sentence describes my first approach, when i obtained incorrect results (garbage in memory).

                                                                            So question is still open.

                                                                              • Kernel function problem
                                                                                gaurav.garg

                                                                                With the given width & height, I could reproduce this. A quick workaround to resolve this problem is to use regular strream instead of gather stream-

                                                                                kernel void func1(unsigned char src<>, unsigned char str, out unsigned char o_img<> )
                                                                                {
                                                                                    // Output position
                                                                                    int j = instance().x; // width
                                                                                    int i = instance().y; // height
                                                                                  
                                                                                    int rest = j % 16;

                                                                                    if (rest == 0)
                                                                                    {
                                                                                        o_img = src;
                                                                                    }
                                                                                    else
                                                                                    {
                                                                                        o_img = str;
                                                                                    }
                                                                                }

                                                                                  • Kernel function problem
                                                                                    gaurav.garg

                                                                                    Now, its confirmed that its a regression with Catalyst 9.4. You can try with previous version of catalyst.

                                                                                      • Kernel function problem
                                                                                        fandango

                                                                                        So, in other words, it is a driver problem. Is it right?

                                                                                          • Kernel function problem
                                                                                            gaurav.garg

                                                                                            Yes.

                                                                                              • Kernel function problem
                                                                                                fandango

                                                                                                Ok. Thank you very much for your support.

                                                                                                Today I've faced with other problem. I expect another behaviour.

                                                                                                Kernel:

                                                                                                kernel void motion_estimation(unsigned char src[],
                                                                                                                              unsigned char ref[],
                                                                                                                              unsigned char str,
                                                                                                                              unsigned char str2,
                                                                                                                              out unsigned char o_img<>,
                                                                                                                              out double sad<>
                                                                                                {
                                                                                                    // Output position
                                                                                                    int2 vPos = instance().xy;
                                                                                                   
                                                                                                    int i = vPos.x; // width
                                                                                                    int j = vPos.y; // height
                                                                                                   
                                                                                                    if ((j % 4 > 0 || i % 4 > 0) || (j == 12 || i == 32))
                                                                                                    {
                                                                                                        o_img = str;
                                                                                                        sad = 1.0;
                                                                                                    }
                                                                                                    else
                                                                                                    {
                                                                                                        estimate_macroblock_4x4(src, ref, i, j, str2, o_img, sad);
                                                                                                    }
                                                                                                }

                                                                                                kernel int estimate_macroblock_4x4(unsigned char mbs[],
                                                                                                                                   unsigned char mbr[],
                                                                                                                                   int i, int j,
                                                                                                                                   unsigned char str,
                                                                                                                                   out unsigned char o_img<>,
                                                                                                                                   out double sad<>
                                                                                                {

                                                                                                    int x, y;
                                                                                                   
                                                                                                    //sad = (double) (i + 0 + ((j + 0) * 33)) ;
                                                                                                   
                                                                                                    for (x = 0; x < 4; x++)
                                                                                                    {
                                                                                                        for (y = 0; y < 4; y++)
                                                                                                        {

                                                                                                // PROBLEM IS HERE
                                                                                                            int index = i + x + ((j + y) * 33);
                                                                                                            sad += (double)(mbs[index] - mbr[index]);
                                                                                                        }
                                                                                                   
                                                                                                    }
                                                                                                   
                                                                                                    return 0;
                                                                                                }

                                                                                                The part of sad output is

                                                                                                -71.000000  1.000000  1.000000  1.000000 -71.000000  1.000000  ...
                                                                                                 1.000000  1.000000  1.000000  1.000000  1.000000  1.000000  ...
                                                                                                 1.000000  1.000000  1.000000  1.000000  1.000000  1.000000  ...
                                                                                                 1.000000  1.000000  1.000000  1.000000  1.000000  1.000000 ...


                                                                                                -80.000000  1.000000  1.000000  1.000000 -80.000000  1.000000 ...
                                                                                                 1.000000  1.000000  1.000000  1.000000  1.000000  1.000000  ...
                                                                                                 1.000000  1.000000  1.000000  1.000000  1.000000  1.000000  ...
                                                                                                 1.000000  1.000000  1.000000  1.000000  1.000000  1.000000  ...

                                                                                                -71 is correct value. It is sum of differences between blocks

                                                                                                 *   1   1   1
                                                                                                 1   1   1   1
                                                                                                 1   1   1   1
                                                                                                 1   1   1   1

                                                                                                and

                                                                                                /   3   3   3
                                                                                                3   3   3   3
                                                                                                3   3   3   3
                                                                                                3   3   3   3

                                                                                                -80.0 it is difference only between * and /. I expect everywhere -71 instead of 80. It seems like for { for ... does not work for j > 0.

                                                                                                  • Kernel function problem
                                                                                                    gaurav.garg

                                                                                                    I would recommend you to first try with catalyst 9.2 and see if your problems resolve.

                                                                                                    • Kernel function problem
                                                                                                      gaurav.garg

                                                                                                       

                                                                                                      kernel void motion_estimation(unsigned char src[],                               unsigned char ref[],                               unsigned char str,                               unsigned char str2,                               out unsigned char o_img<>,                               out double sad<> {     // Output position     int2 vPos = instance().xy;         int i = vPos.x; // width     int j = vPos.y; // height


                                                                                                      If your input streams src & ref are 2D streams use [][], otherwise its fine.

                                                                                                        • Kernel function problem
                                                                                                          fandango

                                                                                                          My input streams are 1D. Output are 2D.

                                                                                                            • Kernel function problem
                                                                                                              gaurav.garg

                                                                                                              Then its fine. Could you post your runtime code as well?

                                                                                                                • Kernel function problem
                                                                                                                  fandango

                                                                                                                  Do you mean .cpp generated code?


                                                                                                                  ////////////////////////////////////////////
                                                                                                                  // Generated by BRCC 1.4
                                                                                                                  // BRCC Compiled on: Mar  2 2009 13:07:15
                                                                                                                  ////////////////////////////////////////////

                                                                                                                  #include "brook/brook.h"
                                                                                                                  #include "kernel_gpu.h"
                                                                                                                  #include "kernel.h"


                                                                                                                  static __BrtInt1  __estimate_macroblock_4x4_cpu_inner(const __BrtArray<__BrtUChar1  > &mbs,
                                                                                                                                                                  const __BrtArray<__BrtUChar1  > &mbr,
                                                                                                                                                                  const __BrtInt1  &i,
                                                                                                                                                                  const __BrtInt1  &j,
                                                                                                                                                                  const __BrtUChar1  &str,
                                                                                                                                                                  __BrtUChar1  &o_img,
                                                                                                                                                                  __BrtDouble1  &sad)


                                                                                                                  {

                                                                                                                    __BrtInt1  y, x;

                                                                                                                    for (y = __BrtInt1((int)0); y < __BrtInt1((int)4); y++)
                                                                                                                    {
                                                                                                                      for (x = __BrtInt1((int)0); x < __BrtInt1((int)4); x++)
                                                                                                                      {
                                                                                                                        __BrtInt1  index = i + x + (j + y) * __BrtInt1((int)33);

                                                                                                                        sad += (__BrtDouble1 ) (mbs[index] - mbr[index]);
                                                                                                                      }

                                                                                                                    }

                                                                                                                    return __BrtInt1((int)0);
                                                                                                                  }
                                                                                                                  void  __estimate_macroblock_4x4_cpu(::brt::KernelC *__k, int __brt_idxstart, int __brt_idxend, bool __brt_isreduce)
                                                                                                                  {
                                                                                                                    __BrtArray<__BrtUChar1  > *arg_mbs = (__BrtArray<__BrtUChar1  > *) __k->getVectorElement(0);


                                                                                                                    __BrtArray<__BrtUChar1  > *arg_mbr = (__BrtArray<__BrtUChar1  > *) __k->getVectorElement(1);

                                                                                                                    __BrtInt1 *arg_i = (__BrtInt1 *) __k->getVectorElement(2);

                                                                                                                    __BrtInt1 *arg_j = (__BrtInt1 *) __k->getVectorElement(3);


                                                                                                                    __BrtUChar1 *arg_str = (__BrtUChar1 *) __k->getVectorElement(4);


                                                                                                                    ::brt::StreamInterface *arg_o_img = (::brt::StreamInterface *) __k->getVectorElement(5);


                                                                                                                    ::brt::StreamInterface *arg_sad = (::brt::StreamInterface *) __k->getVectorElement(6);
                                                                                                                   
                                                                                                                      for(int __brt_idx=__brt_idxstart; __brt_idx<__brt_idxend; __brt_idx++) {
                                                                                                                    if(!(__k->isValidAddress(__brt_idx))){ continue; }


                                                                                                                      Addressable <__BrtUChar1  > __out_arg_o_img((__BrtUChar1 *) __k->FetchElem(arg_o_img, __brt_idx));

                                                                                                                      Addressable <__BrtDouble1  > __out_arg_sad((__BrtDouble1 *) __k->FetchElem(arg_sad, __brt_idx));

                                                                                                                      __estimate_macroblock_4x4_cpu_inner (

                                                                                                                                                           *arg_mbs,

                                                                                                                                                           *arg_mbr,

                                                                                                                                                           *arg_i,

                                                                                                                                                           *arg_j,

                                                                                                                                                           *arg_str,

                                                                                                                                                           __out_arg_o_img,

                                                                                                                                                           __out_arg_sad);


                                                                                                                      *reinterpret_cast<__BrtUChar1 *>(__out_arg_o_img.address) = __out_arg_o_img.castToArg(*reinterpret_cast<__BrtUChar1 *>(__out_arg_o_img.address));

                                                                                                                      *reinterpret_cast<__BrtDouble1 *>(__out_arg_sad.address) = __out_arg_sad.castToArg(*reinterpret_cast<__BrtDouble1 *>(__out_arg_sad.address));
                                                                                                                    }
                                                                                                                  }

                                                                                                                  void  __motion_estimation_cpu_inner(const __BrtArray<__BrtUChar1  > &src,
                                                                                                                                                     const __BrtArray<__BrtUChar1  > &ref,
                                                                                                                                                     const __BrtUChar1  &str,
                                                                                                                                                     const __BrtUChar1  &str2,
                                                                                                                                                     __BrtUChar1  &o_img,
                                                                                                                                                     __BrtDouble1  &sad)
                                                                                                                  {

                                                                                                                    __BrtInt2  vPos = (indexof(o_img)).swizzle2(::brt::maskX, ::brt::maskY);

                                                                                                                    __BrtInt1  i = vPos.swizzle1(::brt::maskX);

                                                                                                                    __BrtInt1  j = vPos.swizzle1(::brt::maskY);

                                                                                                                    if (j % __BrtInt1((int)4) > __BrtInt1((int)0) || i % __BrtInt1((int)4) > __BrtInt1((int)0) || (j == __BrtInt1((int)12) || i == __BrtInt1((int)32)))

                                                                                                                    {

                                                                                                                      o_img = str;

                                                                                                                      sad = __BrtDouble1((double)1.0);
                                                                                                                    }

                                                                                                                    else
                                                                                                                    {

                                                                                                                      o_img = src[i + j * __BrtInt1((int)33)];

                                                                                                                      __estimate_macroblock_4x4_cpu_inner(src, ref, i, j, str2, o_img, sad);
                                                                                                                    }

                                                                                                                  }
                                                                                                                  void  __motion_estimation_cpu(::brt::KernelC *__k, int __brt_idxstart, int __brt_idxend, bool __brt_isreduce)
                                                                                                                  {

                                                                                                                    __BrtArray<__BrtUChar1  > *arg_src = (__BrtArray<__BrtUChar1  > *) __k->getVectorElement(0);

                                                                                                                    __BrtArray<__BrtUChar1  > *arg_ref = (__BrtArray<__BrtUChar1  > *) __k->getVectorElement(1);


                                                                                                                    __BrtUChar1 *arg_str = (__BrtUChar1 *) __k->getVectorElement(2);

                                                                                                                    __BrtUChar1 *arg_str2 = (__BrtUChar1 *) __k->getVectorElement(3);

                                                                                                                    ::brt::StreamInterface *arg_o_img = (::brt::StreamInterface *) __k->getVectorElement(4);

                                                                                                                    ::brt::StreamInterface *arg_sad = (::brt::StreamInterface *) __k->getVectorElement(5);


                                                                                                                      for(int __brt_idx=__brt_idxstart; __brt_idx<__brt_idxend; __brt_idx++) {
                                                                                                                    if(!(__k->isValidAddress(__brt_idx))){ continue; }

                                                                                                                      Addressable <__BrtUChar1  > __out_arg_o_img((__BrtUChar1 *) __k->FetchElem(arg_o_img, __brt_idx));

                                                                                                                      Addressable <__BrtDouble1  > __out_arg_sad((__BrtDouble1 *) __k->FetchElem(arg_sad, __brt_idx));

                                                                                                                      __motion_estimation_cpu_inner (
                                                                                                                                                     *arg_src,
                                                                                                                                                     *arg_ref,
                                                                                                                                                     *arg_str,
                                                                                                                                                     *arg_str2,
                                                                                                                                                     __out_arg_o_img,
                                                                                                                                                     __out_arg_sad);

                                                                                                                      *reinterpret_cast<__BrtUChar1 *>(__out_arg_o_img.address) = __out_arg_o_img.castToArg(*reinterpret_cast<__BrtUChar1 *>(__out_arg_o_img.address));

                                                                                                                      *reinterpret_cast<__BrtDouble1 *>(__out_arg_sad.address) = __out_arg_sad.castToArg(*reinterpret_cast<__BrtDouble1 *>(__out_arg_sad.address));
                                                                                                                    }
                                                                                                                  }


                                                                                                                  void __motion_estimation:perator()(const ::brook::Stream< uchar >& src,  const ::brook::Stream< uchar >& ref,
                                                                                                                          const uchar  str,
                                                                                                                          const uchar  str2,
                                                                                                                          const ::brook::Stream<  uchar >& o_img,
                                                                                                                          const ::brook::Stream<  double >& sad)
                                                                                                                  {

                                                                                                                    static const void *__motion_estimation_fp[] = {

                                                                                                                       "cal", __motion_estimation_cal,
                                                                                                                       "cpu", (void *) __motion_estimation_cpu,
                                                                                                                       NULL, NULL };

                                                                                                                    ::brook::Kernel  __k(__motion_estimation_fp, brook::KERNEL_MAP);
                                                                                                                    ::brook::ArgumentInfo __argumentInfo;

                                                                                                                    __k.PushGatherStream(src);

                                                                                                                    __k.PushGatherStream(ref);


                                                                                                                    brook::Constant<uchar > constant_2(str);
                                                                                                                    __k.PushConstant(constant_2);

                                                                                                                    brook::Constant<uchar > constant_3(str2);
                                                                                                                    __k.PushConstant(constant_3);
                                                                                                                    __k.PushOutput(o_img);

                                                                                                                    __k.PushOutput(sad);

                                                                                                                    __argumentInfo.startExecDomain = _domainOffset;
                                                                                                                    __argumentInfo.domainDimension = _domainSize;


                                                                                                                    __k.run(&__argumentInfo);
                                                                                                                    DESTROYPARAM();

                                                                                                                  }

                                                                                                                  __THREAD__ __motion_estimation motion_estimation;


                                                                                                                    • Kernel function problem
                                                                                                                      gaurav.garg

                                                                                                                      The code where you declare stream, call kernel and call different operators on stream.

                                                                                                                        • Kernel function problem
                                                                                                                          fandango

                                                                                                                          #include <iostream>

                                                                                                                          #include "brook/Stream.h"
                                                                                                                          #include "brook/KernelInterface.h"
                                                                                                                          #include "brookgenfiles/kernel.h"


                                                                                                                          void print(unsigned char* arr, int width, int height)
                                                                                                                          {
                                                                                                                              for (int i = 0; i < height; i++)
                                                                                                                              {
                                                                                                                                  for(int j = 0; j < width; j++)
                                                                                                                                  {
                                                                                                                                      char cStr[256];
                                                                                                                                      sprintf(cStr, "% 3c ", arr[j + i * width]);
                                                                                                                                      OutputDebugString(cStr);
                                                                                                                                  }

                                                                                                                                  OutputDebugString("\n");
                                                                                                                              }
                                                                                                                              OutputDebugString("\n\n");
                                                                                                                          }

                                                                                                                          void printd(double* arr, int width, int height)
                                                                                                                          {
                                                                                                                              for (int i = 0; i < height; i++)
                                                                                                                              {
                                                                                                                                  for(int j = 0; j < width; j++)
                                                                                                                                  {
                                                                                                                                      char cStr[256];
                                                                                                                                      sprintf(cStr, "% 3f ", arr[j + i * width]);
                                                                                                                                      OutputDebugString(cStr);
                                                                                                                                  }

                                                                                                                                  OutputDebugString("\n");
                                                                                                                              }
                                                                                                                              OutputDebugString("\n\n");
                                                                                                                          }

                                                                                                                          int
                                                                                                                          main(int argc, char* argv[])
                                                                                                                          {
                                                                                                                          // Specifying the width and height of the 2D buffer
                                                                                                                              const unsigned int width = 33;
                                                                                                                              const unsigned int height = 13;

                                                                                                                              //--------------------------------------------------------------------------
                                                                                                                              // Creating and initializing the input buffer
                                                                                                                              //--------------------------------------------------------------------------

                                                                                                                              // Creating an input buffer
                                                                                                                              unsigned char* src = new unsigned char[width * height];
                                                                                                                              unsigned char* ref = new unsigned char[width * height];
                                                                                                                              //memset(src, 7, width * height);

                                                                                                                              for (int i = 0; i < height; i++)
                                                                                                                              {
                                                                                                                                  for(int j = 0; j < width; j++)
                                                                                                                                  {
                                                                                                                                      if (j % 4 == 0 && i % 4 == 0)
                                                                                                                                      {
                                                                                                                                          src[j + i * width] = '*';
                                                                                                                                          ref[j + i * width] = '/';
                                                                                                                                      }
                                                                                                                                      else
                                                                                                                                      {
                                                                                                                                          src[j + i * width] = '1';
                                                                                                                                          ref[j + i * width] = '3';
                                                                                                                                      }
                                                                                                                                  }
                                                                                                                              }

                                                                                                                              print(src, width, height);
                                                                                                                             
                                                                                                                              print(ref, width, height);

                                                                                                                              // specifying the size of the 2D stream
                                                                                                                              unsigned int streamSize[] = {width, height};

                                                                                                                              // specifying the rank of the stream
                                                                                                                              unsigned int rank = 1;

                                                                                                                              brook::Stream<unsigned char> srcStream(rank, streamSize);
                                                                                                                              brook::Stream<unsigned char> refStream(rank, streamSize);

                                                                                                                              // copying data from input buffer to input stream
                                                                                                                              srcStream.read(src);
                                                                                                                              refStream.read(ref);

                                                                                                                              // creating the output stream
                                                                                                                              streamSize[0] = width;
                                                                                                                              streamSize[1] = height;
                                                                                                                              rank = 2;

                                                                                                                              brook::Stream<unsigned char> outputStream(rank, streamSize);

                                                                                                                              // creating the output stream
                                                                                                                              streamSize[0] = width;
                                                                                                                              streamSize[1] = height;
                                                                                                                              rank = 2;

                                                                                                                              brook::Stream<double> sad(rank, streamSize);

                                                                                                                              //--------------------------------------------------------------------------
                                                                                                                              // Executing kernel and copying back data
                                                                                                                              //--------------------------------------------------------------------------   
                                                                                                                              unsigned char str = '9';
                                                                                                                              unsigned char str2 = '+';

                                                                                                                              double ddd = src[0] - ref[0];
                                                                                                                              double sadd = 0;

                                                                                                                              for (int y = 0; y < 4; y++)
                                                                                                                              {
                                                                                                                                  for (int x = 0; x < 4; x++)
                                                                                                                                  {
                                                                                                                                      int index = 0 + x + ((0 + y) * 33);
                                                                                                                                      sadd += (double)(src[index] - ref[index]);
                                                                                                                                  }
                                                                                                                             
                                                                                                                              }

                                                                                                                              // Calling the kernel on the input and output streams
                                                                                                                              motion_estimation(srcStream, refStream, str, str2, outputStream, sad);

                                                                                                                              // Creating an output buffer
                                                                                                                              unsigned char* out = new unsigned char[width * height];
                                                                                                                              memset(out, 0, width * height);

                                                                                                                              double *das = new double[width * height];
                                                                                                                              memset(out, 0, width * height);

                                                                                                                              // Copying data from output stream to output buffer
                                                                                                                              outputStream.write(out);
                                                                                                                              sad.write(das);

                                                                                                                              print(out, width, height);

                                                                                                                              printd(das, width, height);

                                                                                                                              // Check error on stream
                                                                                                                              if(outputStream.error())
                                                                                                                              {
                                                                                                                                  // Print error Log associated to stream
                                                                                                                                  fprintf(stdout, "%s\n", outputStream.errorLog());
                                                                                                                              }

                                                                                                                              fprintf(stdout, "Output buffer:\n");
                                                                                                                          //    printBuffer(outputBuffer, width, 0, 0, 8, 8);

                                                                                                                              //--------------------------------------------------------------------------
                                                                                                                              // Checking whether the result is correct or not
                                                                                                                              //--------------------------------------------------------------------------
                                                                                                                             

                                                                                                                              //--------------------------------------------------------------------------
                                                                                                                              // Cleaning up
                                                                                                                              //--------------------------------------------------------------------------
                                                                                                                             
                                                                                                                              delete[] src;
                                                                                                                              delete[] ref;
                                                                                                                             
                                                                                                                              return 0;
                                                                                                                          }


                                                                                                                          kernel int estimate_macroblock_4x4(unsigned char mbs[],
                                                                                                                                                             unsigned char mbr[],
                                                                                                                                                             int i, int j,
                                                                                                                                                             unsigned char str,
                                                                                                                                                             out unsigned char o_img<>,
                                                                                                                                                             out double sad<>
                                                                                                                          {
                                                                                                                              //o_img = str;
                                                                                                                              int x, y;
                                                                                                                             
                                                                                                                              //sad = (double) (i + 0 + ((j + 0) * 33)) ;
                                                                                                                             
                                                                                                                              for (y = 0; y < 4; y++)
                                                                                                                              {
                                                                                                                                  for (x = 0; x < 4; x++)
                                                                                                                                  {
                                                                                                                                      int index = i + x + ((j + y) * 33);
                                                                                                                                      sad += (double)(mbs[index] - mbr[index]);
                                                                                                                                  }
                                                                                                                             
                                                                                                                              }
                                                                                                                              //o_img = str;
                                                                                                                             
                                                                                                                              return 0;
                                                                                                                          }

                                                                                                                          kernel void motion_estimation(unsigned char src[],
                                                                                                                                                        unsigned char ref[],
                                                                                                                                                        unsigned char str,
                                                                                                                                                        unsigned char str2,
                                                                                                                                                        out unsigned char o_img<>,
                                                                                                                                                        out double sad<>
                                                                                                                          {
                                                                                                                              // Output position
                                                                                                                              int2 vPos = instance().xy;
                                                                                                                             
                                                                                                                              int i = vPos.x; // width
                                                                                                                              int j = vPos.y; // height
                                                                                                                             
                                                                                                                              if ((j % 4 > 0 || i % 4 > 0) || (j == 12 || i == 32))
                                                                                                                              {
                                                                                                                                  o_img = str;
                                                                                                                                  sad = 1.0;
                                                                                                                              }
                                                                                                                              else
                                                                                                                              {
                                                                                                                                  o_img = src[i + j * 33];
                                                                                                                                  estimate_macroblock_4x4(src, ref, i, j, str2, o_img, sad);
                                                                                                                              }
                                                                                                                          }


                                                                                                                            • Kernel function problem
                                                                                                                              gaurav.garg

                                                                                                                              One thing that is definitely wrong with your test case is out of range indexing of 1D input streams.Your input stream is 1D with size = width and not width * height

                                                                                                                               

                                                                                                                              // specifying the size of the 2D stream
                                                                                                                                  unsigned int streamSize[] = {width, height};

                                                                                                                              // specifying the rank of the stream
                                                                                                                              unsigned int rank = 1;

                                                                                                                              brook::Stream<unsigned char> srcStream(rank, streamSize);
                                                                                                                              brook::Stream<unsigned char> refStream(rank, streamSize);



                                                                                                                               

                                                                                                                              I think you want to do the following-

                                                                                                                               

                                                                                                                               

                                                                                                                              // specifying the size of the 2D stream
                                                                                                                                  unsigned int streamSize[] = {width * height};

                                                                                                                              // specifying the rank of the stream
                                                                                                                              unsigned int rank = 1;

                                                                                                                              brook::Stream<unsigned char> srcStream(rank, streamSize);
                                                                                                                              brook::Stream<unsigned char> refStream(rank, streamSize);



                                                                                                                                • Kernel function problem
                                                                                                                                  fandango

                                                                                                                                  Yes, you are right. It helps. Thank you.

                                                                                                                                    • Kernel function problem
                                                                                                                                      fandango

                                                                                                                                      Hello again! I decided to continue this topic by next question.

                                                                                                                                      This line if (testsad <= sad[j][ i]) produce next error:

                                                                                                                                      error C2676: binary '<=' : '__BrtDouble1' does not define this operator or a conversion to a type acceptable to the predefined operator

                                                                                                                                      kernel void motion_estimation(unsigned char src[],
                                                                                                                                                                    unsigned char ref[],
                                                                                                                                                                    int width,
                                                                                                                                                                    int height,
                                                                                                                                                                    out double sad[][])


                                                                                                                                      What is the problem here? I can not obtain elements from output array?

                                                                                                                                        • Kernel function problem
                                                                                                                                          gaurav.garg

                                                                                                                                          It seems sad is 2D scatter stream, shouldn't you index it with 2D indices.

                                                                                                                                            • Kernel function problem
                                                                                                                                              fandango

                                                                                                                                              No Of course I use [] [], it is forum problem. The second brackets were removed by unknown reasons. I put space after '[' and it helps.

                                                                                                                                              Ok. Any other ideas?

                                                                                                                                                • Kernel function problem
                                                                                                                                                  gaurav.garg

                                                                                                                                                  Could you post the datatype of testsad? These are some template errors from CPU runtime and doesn't show up on all the versions of gcc.

                                                                                                                                                  I would suggest you to disable CPU backend code generation to resolve these issues. You can compile .br file with -p cal option to disable CPU codegen.

                                                                                                                                                    • Kernel function problem
                                                                                                                                                      fandango

                                                                                                                                                      the type is double.

                                                                                                                                                      i use mvc. i will try you advice

                                                                                                                                                        • Kernel function problem
                                                                                                                                                          fandango

                                                                                                                                                          Do you mean like this?

                                                                                                                                                          mkdir brookgenfiles | "$(BROOKROOT)\sdk\bin\brcc_d.exe" -p cal -o "$(ProjectDir)\brookgenfiles\$(InputName)" "$(InputPath)"

                                                                                                                                                          It helps. Thanks.

                                                                                                                                                          Can I install latest ATI drivers? I mean did you fix that problem what was specified at beginning of the thread (memory garbage)?

                                                                                                                                                          Thanks

                                                                                                                                                            • Kernel function problem
                                                                                                                                                              fandango

                                                                                                                                                              Additional remark regarding brook compiler.

                                                                                                                                                              The expression:

                                                                                                                                                              double sad = (double)(abs((((src[idx]) - ((ref[idx])))), where src and ref are unsigned char [] cause repletion.

                                                                                                                                                              The correct variant here

                                                                                                                                                              double sad = (double)(abs((((int)src[idx]) - ((int)ref[idx]))))

                                                                                                                                                              But I believe compiler should automatically converts to integer operation.

                                                                                                                                                                • Kernel function problem
                                                                                                                                                                  fandango

                                                                                                                                                                  Additional question.

                                                                                                                                                                  Can I pass more than one output buffer.

                                                                                                                                                                  As I understand output buffer defines domain of execution. So kernel can use only one output stream. Is it right?

                                                                                                                                                                  I need the additional array with the same size as output stream. Like this:

                                                                                                                                                                  kernel void motion_estimation(unsigned char src[],
                                                                                                                                                                                                unsigned char ref[],
                                                                                                                                                                                                int width,
                                                                                                                                                                                                int height,

                                                                                                                                                                                                int mv[][], // additional buffer
                                                                                                                                                                                                out double sad[][])

                                                                                                                                                                    • Kernel function problem
                                                                                                                                                                      gaurav.garg

                                                                                                                                                                      You can use multiple regular output streams, but multiple scatter streams are not supported.

                                                                                                                                                                        • Kernel function problem
                                                                                                                                                                          fandango

                                                                                                                                                                          Give me example please? Is it affected performance?

                                                                                                                                                                            • Kernel function problem
                                                                                                                                                                              gaurav.garg

                                                                                                                                                                              kernel void multiple_ouput(out float o0<>, out float4 o1<> //valid - Good in performance as I would expect it would increase compute intensity of kernel compared to calling two kernel with single output streams

                                                                                                                                                                              kernel void multiple_scatter(out float o0[], out float4 o1[]) // not supported

                                                                                                                                                                              kernel void mix_output(out float o0[], out float4 o1<> // supported, but computation is done in multiple passes, so performance is similar to calling two kernels with single output streams

                                                                                                                                                                                • Kernel function problem
                                                                                                                                                                                  fandango

                                                                                                                                                                                  Ok. Thanks.

                                                                                                                                                                                  Are these chunks of code similar?

                                                                                                                                                                                  kernel void motion_estimation(unsigned char src[],
                                                                                                                                                                                                                unsigned char ref[],
                                                                                                                                                                                                                int width,
                                                                                                                                                                                                                int height,
                                                                                                                                                                                                                out double sad[][])
                                                                                                                                                                                  {
                                                                                                                                                                                      // Output position
                                                                                                                                                                                      int2 vPos = instance().xy;
                                                                                                                                                                                     
                                                                                                                                                                                      int i = vPos.x; // width
                                                                                                                                                                                      int j = vPos.y; // height

                                                                                                                                                                                     if (i % 16 == 0 && j % 16 == 0)

                                                                                                                                                                                      sad[j][ i] = 1.0;

                                                                                                                                                                                  }


                                                                                                                                                                                  and


                                                                                                                                                                                  kernel void motion_estimation(unsigned char src[],
                                                                                                                                                                                                                unsigned char ref[],
                                                                                                                                                                                                                int width,
                                                                                                                                                                                                                int height,
                                                                                                                                                                                                                out double sad<>
                                                                                                                                                                                  {
                                                                                                                                                                                      // Output position
                                                                                                                                                                                      int2 vPos = instance().xy;
                                                                                                                                                                                     
                                                                                                                                                                                      int i = vPos.x; // width
                                                                                                                                                                                      int j = vPos.y; // height

                                                                                                                                                                                     if (i % 16 == 0 && j % 16 == 0)

                                                                                                                                                                                      sad = 1.0;

                                                                                                                                                                                  }


                                                                                                                                                                                    • Kernel function problem
                                                                                                                                                                                      fandango

                                                                                                                                                                                      The next question.

                                                                                                                                                                                      The key -p cal helps to avoid compile template errors, but unfortunately it hampers to debug program. I mean return values in out stream are corrupted when I compiled program with -p cal key. As soon as I remove -p cal and rebuild project this problem fades out. But template errors return

                                                                                                                                                                                      What do you advice me?

                                                                                                                                                                                      • Kernel function problem
                                                                                                                                                                                        gaurav.garg

                                                                                                                                                                                        Yes, both the above kernels are same and the second kernel would have much better performance. Scatter streams are used for random writing, but if you always write to instance() position, its better to use regular output stream.

                                                                                                                                                                                        -p cal disables CPU backend codegen, so as long as you are are not running your code in CPU emulation mode, everything should be fine. Make sure you have not set environment variable BRT_RUNTIME=cpu

                                                                                                                                                                                          • Kernel function problem
                                                                                                                                                                                            fandango

                                                                                                                                                                                            Gaurav,

                                                                                                                                                                                            I wonder, what is advantage of using cpu emulator? I understand that instructions are executed on cpu. And what...? Anyway I can not to enter kernel and debugging inside.

                                                                                                                                                                                              • Kernel function problem
                                                                                                                                                                                                gaurav.garg

                                                                                                                                                                                                The purpose of CPU backend code is for debugging only. You can debug inside kernel if you disable line generation in cpp file (use -nl option)

                                                                                                                                                                                                  • Kernel function problem
                                                                                                                                                                                                    fandango

                                                                                                                                                                                                    Gaurav,

                                                                                                                                                                                                    BRT_RUNTIME = cal

                                                                                                                                                                                                    mkdir brookgenfiles | "$(BROOKROOT)\sdk\bin\brcc_d.exe" -p cal -o "$(ProjectDir)\brookgenfiles\$(InputName)" "$(InputPath)"

                                                                                                                                                                                                    Ouput is broken yet. Why?

                                                                                                                                                                                                      • Kernel function problem
                                                                                                                                                                                                        gaurav.garg

                                                                                                                                                                                                        That is strange. Are you sure it works without -p cal option? I mean how did you test it with template error? Could you post the test case?

                                                                                                                                                                                                          • Kernel function problem
                                                                                                                                                                                                            fandango

                                                                                                                                                                                                            I just use the simple test in this case.

                                                                                                                                                                                                             

                                                                                                                                                                                                            kernel void motion_estimation(unsigned char src[],
                                                                                                                                                                                                                                          unsigned char ref[],
                                                                                                                                                                                                                                          int width,
                                                                                                                                                                                                                                          int height,
                                                                                                                                                                                                                                          out double sad<>,
                                                                                                                                                                                                                                          out int mvx<>,
                                                                                                                                                                                                                                          out int mvy<>
                                                                                                                                                                                                            {
                                                                                                                                                                                                                // Output position
                                                                                                                                                                                                                int2 vPos = instance().xy;
                                                                                                                                                                                                               
                                                                                                                                                                                                                int i = vPos.x; // width
                                                                                                                                                                                                                int j = vPos.y; // height
                                                                                                                                                                                                               
                                                                                                                                                                                                                int ix = i * 16;
                                                                                                                                                                                                                int jy = j * 16;

                                                                                                                                                                                                               sad = 2.0;

                                                                                                                                                                                                            }

                                                                                                                                                                                                             

                                                                                                                                                                                                            That's it. So, no templates. With -p cal option output sad contains garbage. With -p cpu everything is ok (2.0 value).

                                                                                                                                                                                                             

                                                                                                                                                                                                              • Kernel function problem
                                                                                                                                                                                                                gaurav.garg

                                                                                                                                                                                                                Something is going wrong. It seems you are running your code under CPU backend. Make sure you close your visual studo or command prompt after changing environment variable and then open it again to read the updated env variable.

                                                                                                                                                                                                                  • Kernel function problem
                                                                                                                                                                                                                    fandango

                                                                                                                                                                                                                    You are right. I tried to restart VS, but it did not help.

                                                                                                                                                                                                                    The windows restart helps.

                                                                                                                                                                                                                      • Kernel function problem
                                                                                                                                                                                                                        fandango

                                                                                                                                                                                                                        if ((sad >= testsad) && (mvlength > abs(y) + abs(x)))
                                                                                                                                                                                                                        {
                                                                                                                                                                                                                                 sad = testsad;
                                                                                                                                                                                                                                 mvlength = abs(y) + abs(x);
                                                                                                                                                                                                                                 mvy = y;
                                                                                                                                                                                                                                 mvx = x;
                                                                                                                                                                                                                        }

                                                                                                                                                                                                                        ERROR--1: In Binary expression: Mismatched operands: both must have same type and same number of components
                                                                                                                                                                                                                        1> Statement: sad >= testsad && mvlength > abs(y) + abs(x) in sad >= testsad && mvlength > abs(y) + abs(x)
                                                                                                                                                                                                                        1> Expression : sad >= testsad, Type : double
                                                                                                                                                                                                                        1> Expression : mvlength > abs(y) + abs(x), Type : int

                                                                                                                                                                                                                         

                                                                                                                                                                                                                        Try to do this : if (((int)sad >= (int)testsad) && (mvlength > abs(y) + abs(x))) and this condition never is true.

                                                                                                                                                                                                                         

                                                                                                                                                                                                                         

                                                                                                                                                                                                                         

                                                                                                                                                                                                                         

                                                                                                                                                                                                                          • Kernel function problem
                                                                                                                                                                                                                            gaurav.garg

                                                                                                                                                                                                                            brcc returns the same type from conditional expressions as of operands. You can try this-

                                                                                                                                                                                                                            if ((int)(sad >= testsad) && (mvlength > abs(y) + abs(x)))

                                                                                                                                                                                                                              • Kernel function problem
                                                                                                                                                                                                                                fandango

                                                                                                                                                                                                                                No, I suppose your variant is incorrect. I have checked.

                                                                                                                                                                                                                                The correct:

                                                                                                                                                                                                                                if (((int)sad >= (int)testsad) && (mvlength > abs(y) + abs(x)))

                                                                                                                                                                                                                                I just figured out it.

                                                                                                                                                                                                                                 

                                                                                                                                                                                                                                 

                                                                                                                                                                                                                                  • Kernel function problem
                                                                                                                                                                                                                                    gaurav.garg

                                                                                                                                                                                                                                    But, it would cause a conversion of sad and testsad before checking the condition and can produce incorrect results

                                                                                                                                                                                                                                      • Kernel function problem
                                                                                                                                                                                                                                        fandango

                                                                                                                                                                                                                                        Hmm, a many days try to understand what is going on with my kernel code.

                                                                                                                                                                                                                                        Probably you can help me. Are these code chuncks similar? I mean logic.

                                                                                                                                                                                                                                        This chunk I execute on cpu

                                                                                                                                                                                                                                            int xleft = 0, xright = 16;
                                                                                                                                                                                                                                            int ytop = 0, ybottom = 16;

                                                                                                                                                                                                                                            int temp = 0;
                                                                                                                                                                                                                                            int mvlength = 100000000;

                                                                                                                                                                                                                                            for (int j = 0; j < height; j += 16)
                                                                                                                                                                                                                                            {
                                                                                                                                                                                                                                                for (int i = 0; i < width; i += 16)
                                                                                                                                                                                                                                                {
                                                                                                                                                                                                                                                    // set top and bottom range
                                                                                                                                                                                                                                                    ytop = - min(j, 16);
                                                                                                                                                                                                                                                    ybottom = min(height - 16 - j + 1, 16);

                                                                                                                                                                                                                                                    // set left and right range
                                                                                                                                                                                                                                                    xleft = - min(i, 16);
                                                                                                                                                                                                                                                    xright = min(width - 16 - i + 1, 16);

                                                                                                                                                                                                                                                    refsad[l] = 100000000;

                                                                                                                                                                                                                                                    for (int y = ytop; y < ybottom; y++)
                                                                                                                                                                                                                                                    {
                                                                                                                                                                                                                                                        for (int x = xleft; x < xright; x++)
                                                                                                                                                                                                                                                        {
                                                                                                                                                                                                                                                            int srcidx = i + (j * width);

                                                                                                                                                                                                                                                            int index = i + x + ((j + y) * width);

                                                                                                                                                                                                                                                            // calculate SAD
                                                                                                                                                                                                                                                            //--------------------------------
                                                                                                                                                                                                                                                             for (m = 0; m < 16; m++)
                                                                                                                                                                                                                                                            {
                                                                                                                                                                                                                                                                for (n = 0; n < 16; n++)
                                                                                                                                                                                                                                                                {
                                                                                                                                                                                                                                                                    temp += abs((src[srcidx + n] - ref[index + n]));
                                                                                                                                                                                                                                                                }

                                                                                                                                                                                                                                                                srcidx += width;
                                                                                                                                                                                                                                                                index += width;
                                                                                                                                                                                                                                                            }
                                                                                                                                                                                                                                                            //-------------------------------

                                                                                                                                                                                                                                                            if ((refsad[l] >= temp) && (mvlength > abs(x) + abs(y)))
                                                                                                                                                                                                                                                            {
                                                                                                                                                                                                                                                                refsad[l] = temp;
                                                                                                                                                                                                                                                                mvlength = abs(x) + abs(y);
                                                                                                                                                                                                                                                                refmvx[l] = x;
                                                                                                                                                                                                                                                                refmvy[l] = y;

                                                                                                                                                                                                                                                                refmvl[l] = x;
                                                                                                                                                                                                                                                            }

                                                                                                                                                                                                                                                            temp = 0.0;
                                                                                                                                                                                                                                                        }
                                                                                                                                                                                                                                                    }

                                                                                                                                                                                                                                                    l++;
                                                                                                                                                                                                                                                    mvlength = 100000000;
                                                                                                                                                                                                                                                }
                                                                                                                                                                                                                                            }

                                                                                                                                                                                                                                        And this as kernel

                                                                                                                                                                                                                                         int ytop = - min(jy, 16);
                                                                                                                                                                                                                                                int ybottom = min(height - 16 - jy + 1, 16);

                                                                                                                                                                                                                                                // set left and right range
                                                                                                                                                                                                                                                int xleft = - min(ix, 16);
                                                                                                                                                                                                                                                int xright = min(width - 16 - ix + 1, 16);
                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                int x, y;
                                                                                                                                                                                                                                                int m, n;
                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                int mvlength = 100000000;
                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                sad = 100000000;

                                                                                                                                                                                                                                                for (y = ytop; y < ybottom; y++)
                                                                                                                                                                                                                                                {
                                                                                                                                                                                                                                                    for (x = xleft; x < xright; x++)
                                                                                                                                                                                                                                                    {
                                                                                                                                                                                                                                                        int testsad = 0;
                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                        int srcidx = ix + (jy * width);
                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                        int idx = ix + x + ((jy + y) * width);
                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                        for (m = 0; m < 16; m++)
                                                                                                                                                                                                                                                        {
                                                                                                                                                                                                                                                            for (n = 0; n < 16; n++)
                                                                                                                                                                                                                                                            {               
                                                                                                                                                                                                                                                                testsad += (abs((((int)src[srcidx + n]) - ((int)ref[idx + n]))));   
                                                                                                                                                                                                                                                            }
                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                            srcidx += width;
                                                                                                                                                                                                                                                            idx += width;
                                                                                                                                                                                                                                                        }
                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                        if ((sad >= testsad) && (mvlength > (abs(y) + abs(x))))
                                                                                                                                                                                                                                                        {
                                                                                                                                                                                                                                                            sad = testsad;
                                                                                                                                                                                                                                                            mvlength = (abs(y) + abs(x));
                                                                                                                                                                                                                                                            mvy = y;
                                                                                                                                                                                                                                                            mvx = x;
                                                                                                                                                                                                                                                            mvl = mvlength;
                                                                                                                                                                                                                                                        }
                                                                                                                                                                                                                                                    }
                                                                                                                                                                                                                                                }
                                                                                                                                                                                                                                            }

                                                                                                                                                                                                                                         

                                                                                                                                                                                                                                        What do you think is the same logic? I have different results in mvx and mvy. Probably you see mistakes in kernel code. Because I expect absolutely the same behaviour.

                                                                                                                                                                                                                                        I think problem exists in latest if.

                                                                                                                                                                                                                                        if ((sad >= testsad) && (mvlength > (abs(y) + abs(x))))

                                                                                                                                                                                                                                        If you need additional code let me know.