1 Reply Latest reply on May 25, 2011 11:21 AM by himanshu.gautam

    Speed problem with dilation

    Tasp

      Hello,

      I'am trying to do an optimzed binary dilation on the GPU.
      My setup is Ubuntu 10.04 64bit, sdk 2.4, HD 5850.

      First of all I wanted to copy all the data for a workgroup into local memory.
      Code basically looks like this:

      __kernel __attribute__((reqd_work_group_size(16, 16, 1)))
      void dilate_test(const __global uchar* const pImgIn, __global uchar* pImgOut, const int width)
      {

      // initialization
      ...

      __local uchar buffer[bufferWidth * bufferHeight];
      for (int a = 0; a < bufferHeight; ++a)
      {
          for (int b = 0; b < bufferWidth / 16; ++b)
          {
              vstore16(vload16(b, pImgIn + idx), b, buffer + idxBuffer);
          }

          // copy the rest of row with single uchars
          ...
      }
      mem_fence(CLK_LOCAL_MEM_FENCE);
      // end of kernel for testing
      }



      I profile against OpenCV with a kernel size of 5 x 5 and an image of 4096+4 x 4096+4. Workgroup size is 16 x 16.

      The timings I get are:
      0.059553 OpenCL
      0.015248 OpenCV

      But OpenCL here only copys to local memory, while OpenCV does the full dilation.
      The time for OpenCL here is just for enqueueNDRangeKernel to commandQueue.finish() no upload/download included. And I execute the kernel 30 time to "warmup" the environment like I read here a few times. 

      Is my code wrong (performance-wise)? OpenCL setup somehow not working correctly?

      update

      By running several samples I think I can be sure that the OpenCL setup is ok.

        • Speed problem with dilation
          himanshu.gautam

          OpenCL is not a very suitable option when you don't have enough computations to be done on GPU. Global memory fetches are generally very time consuming.

          I am not very versed with openCV so i can only guess.

          I guess OpenCV would be using some differnet access patterns( Image like) which can be faster than linear access as you do in OpenCL. I suggest you to try using images in opencl, it might be helpful.