Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

OpenCL kernel memory optimizations

Hi all.

I'm implementing integer Discrete Cosine Transformation (iDCT) kernel in OpenCL 1.1

The code looks like this:

typedef unsigned char pixel;

__kernel __attribute__(( reqd_work_group_size(8, 8, 1) ))

void cl_iDCT(

    __global pixel*       current_frame,

    __global const short* coefficients)


    const uint2

        position_global = (get_global_id(0), get_global_id(1)),

        position_local  = (get_local_id(0),  get_local_id(1));


    const unsigned int

        plain_position_global = position_global.x + position_global.y * get_global_size(0);


    //Local storage for iDCT transformation temporary coefficients

    __local int pool[8][8];


    //Copy coefficients

    pool[position_local.y][position_local.x] = coefficients[plain_position_global];



     *Do iDCT transformation, save result in pool[WI_Y][WI_X]




    current_frame[plain_position_global] += convert_uchar_sat(pool[WI_Y][WI_X]);




This kernel process FullHD frame (1920 * 1088 pixels to be precise), so that one work item do transformation for 1 pixel.

Input parameter is the array of iDCT coefficients of size 1920 * 1088

Input-output parameter is the picture of same dimensions.

I used plain buffer instead of clImage2D, because I need read-write access to buffer.

The kernel works as desired. Using the event profiler, I discovered, that achieved bandwidth is about 5 Gb/s, which looks poor for Radeon 6770.

The bottleneck AFAIK, is the writing single bytes, which is too small chunk of data.

So the question is - can compiler coalesce memory access in this particular case & write chunk of 8 bytes in a row?

System spec:

GPU - AMD 6770

Ubuntu 12.04 x64

OpenCL 1.1

4 Replies

Compile will not do that. You will have to do it.

Consider first writing to local memory as bytes (which is still not efficient though)...and then empty it into global memory using "int" writes.

Look at the Histogram APP SDK sample to see how you can swizzle the bytes so that you can escape local memory bank conflicts.

Can vector load/store be the solution?


vector load/store are essential for 6770, as it is VLIW architecture. Make sure each work-item processes atleast 4 pixels (probably a 4X4 area or maybe bigger). You cannot use VLIW properly without vectorization.

Unfortunately I have to say, do not check DCT Sample From AMD APP SDK. It seems to be written keeping GCN architecture in mind.


Load and store as "ints" should be good on GCN. However 6770 is non-gcn. So, You should actually be looking at writing/loading as vectors, say int4