I'm implementing integer Discrete Cosine Transformation (iDCT) kernel in OpenCL 1.1
The code looks like this:
typedef unsigned char pixel;
__kernel __attribute__(( reqd_work_group_size(8, 8, 1) ))
__global pixel* current_frame,
__global const short* coefficients)
position_global = (get_global_id(0), get_global_id(1)),
position_local = (get_local_id(0), get_local_id(1));
const unsigned int
plain_position_global = position_global.x + position_global.y * get_global_size(0);
//Local storage for iDCT transformation temporary coefficients
__local int pool;
pool[position_local.y][position_local.x] = coefficients[plain_position_global];
*Do iDCT transformation, save result in pool[WI_Y][WI_X]
current_frame[plain_position_global] += convert_uchar_sat(pool[WI_Y][WI_X]);
This kernel process FullHD frame (1920 * 1088 pixels to be precise), so that one work item do transformation for 1 pixel.
Input parameter is the array of iDCT coefficients of size 1920 * 1088
Input-output parameter is the picture of same dimensions.
I used plain buffer instead of clImage2D, because I need read-write access to buffer.
The kernel works as desired. Using the event profiler, I discovered, that achieved bandwidth is about 5 Gb/s, which looks poor for Radeon 6770.
The bottleneck AFAIK, is the writing single bytes, which is too small chunk of data.
So the question is - can compiler coalesce memory access in this particular case & write chunk of 8 bytes in a row?
GPU - AMD 6770
Ubuntu 12.04 x64
Compile will not do that. You will have to do it.
Consider first writing to local memory as bytes (which is still not efficient though)...and then empty it into global memory using "int" writes.
Look at the Histogram APP SDK sample to see how you can swizzle the bytes so that you can escape local memory bank conflicts.
vector load/store are essential for 6770, as it is VLIW architecture. Make sure each work-item processes atleast 4 pixels (probably a 4X4 area or maybe bigger). You cannot use VLIW properly without vectorization.
Unfortunately I have to say, do not check DCT Sample From AMD APP SDK. It seems to be written keeping GCN architecture in mind.