4 Replies Latest reply on Jun 6, 2013 12:51 AM by himanshu.gautam

    OpenCL kernel memory optimizations


      Hi all.


      I'm implementing integer Discrete Cosine Transformation (iDCT) kernel in OpenCL 1.1

      The code looks like this:


      typedef unsigned char pixel;
      __kernel __attribute__(( reqd_work_group_size(8, 8, 1) ))
      void cl_iDCT(
          __global pixel*       current_frame,
          __global const short* coefficients)
          const uint2
              position_global = (get_global_id(0), get_global_id(1)),
              position_local  = (get_local_id(0),  get_local_id(1));
          const unsigned int
              plain_position_global = position_global.x + position_global.y * get_global_size(0);
          //Local storage for iDCT transformation temporary coefficients
          __local int pool[8][8];
          //Copy coefficients
          pool[position_local.y][position_local.x] = coefficients[plain_position_global];
           *Do iDCT transformation, save result in pool[WI_Y][WI_X]
          current_frame[plain_position_global] += convert_uchar_sat(pool[WI_Y][WI_X]);


      This kernel process FullHD frame (1920 * 1088 pixels to be precise), so that one work item do transformation for 1 pixel.


      Input parameter is the array of iDCT coefficients of size 1920 * 1088

      Input-output parameter is the picture of same dimensions.


      I used plain buffer instead of clImage2D, because I need read-write access to buffer.

      The kernel works as desired. Using the event profiler, I discovered, that achieved bandwidth is about 5 Gb/s, which looks poor for Radeon 6770.

      The bottleneck AFAIK, is the writing single bytes, which is too small chunk of data.

      So the question is - can compiler coalesce memory access in this particular case & write chunk of 8 bytes in a row?


      System spec:

      GPU - AMD 6770

      Ubuntu 12.04 x64

      OpenCL 1.1