AnsweredAssumed Answered

OpenCL kernel memory optimizations

Question asked by roman.arzum on Jun 5, 2013
Latest reply on Jun 6, 2013 by himanshu.gautam

Hi all.


I'm implementing integer Discrete Cosine Transformation (iDCT) kernel in OpenCL 1.1

The code looks like this:


typedef unsigned char pixel;

__kernel __attribute__(( reqd_work_group_size(8, 8, 1) ))
void cl_iDCT(
    __global pixel*       current_frame,
    __global const short* coefficients)
    const uint2
        position_global = (get_global_id(0), get_global_id(1)),
        position_local  = (get_local_id(0),  get_local_id(1));
    const unsigned int
        plain_position_global = position_global.x + position_global.y * get_global_size(0);
    //Local storage for iDCT transformation temporary coefficients
    __local int pool[8][8];
    //Copy coefficients
    pool[position_local.y][position_local.x] = coefficients[plain_position_global];


     *Do iDCT transformation, save result in pool[WI_Y][WI_X]

    current_frame[plain_position_global] += convert_uchar_sat(pool[WI_Y][WI_X]);


This kernel process FullHD frame (1920 * 1088 pixels to be precise), so that one work item do transformation for 1 pixel.


Input parameter is the array of iDCT coefficients of size 1920 * 1088

Input-output parameter is the picture of same dimensions.


I used plain buffer instead of clImage2D, because I need read-write access to buffer.

The kernel works as desired. Using the event profiler, I discovered, that achieved bandwidth is about 5 Gb/s, which looks poor for Radeon 6770.

The bottleneck AFAIK, is the writing single bytes, which is too small chunk of data.

So the question is - can compiler coalesce memory access in this particular case & write chunk of 8 bytes in a row?


System spec:

GPU - AMD 6770

Ubuntu 12.04 x64

OpenCL 1.1