4 Replies Latest reply on Jun 6, 2013 12:51 AM by himanshu.gautam

    OpenCL kernel memory optimizations

    roman.arzum

      Hi all.

       

      I'm implementing integer Discrete Cosine Transformation (iDCT) kernel in OpenCL 1.1

      The code looks like this:

       

      typedef unsigned char pixel;
      
      __kernel __attribute__(( reqd_work_group_size(8, 8, 1) ))
      void cl_iDCT(
          __global pixel*       current_frame,
          __global const short* coefficients)
      {
          const uint2
              position_global = (get_global_id(0), get_global_id(1)),
              position_local  = (get_local_id(0),  get_local_id(1));
         
          const unsigned int
              plain_position_global = position_global.x + position_global.y * get_global_size(0);
            
          //Local storage for iDCT transformation temporary coefficients
          __local int pool[8][8];
         
          //Copy coefficients
          pool[position_local.y][position_local.x] = coefficients[plain_position_global];
      
          barrier(CLK_LOCAL_MEM_FENCE);
      
          /*
           *Do iDCT transformation, save result in pool[WI_Y][WI_X]
           */
      
          barrier(CLK_LOCAL_MEM_FENCE);
         
          current_frame[plain_position_global] += convert_uchar_sat(pool[WI_Y][WI_X]);
         
          return;
      }
      
      

       

      This kernel process FullHD frame (1920 * 1088 pixels to be precise), so that one work item do transformation for 1 pixel.

       

      Input parameter is the array of iDCT coefficients of size 1920 * 1088

      Input-output parameter is the picture of same dimensions.

       

      I used plain buffer instead of clImage2D, because I need read-write access to buffer.

      The kernel works as desired. Using the event profiler, I discovered, that achieved bandwidth is about 5 Gb/s, which looks poor for Radeon 6770.

      The bottleneck AFAIK, is the writing single bytes, which is too small chunk of data.


      So the question is - can compiler coalesce memory access in this particular case & write chunk of 8 bytes in a row?

       

      System spec:

      GPU - AMD 6770

      Ubuntu 12.04 x64

      OpenCL 1.1