I'am trying to do an optimzed binary dilation on the GPU.
My setup is Ubuntu 10.04 64bit, sdk 2.4, HD 5850.
First of all I wanted to copy all the data for a workgroup into local memory.
Code basically looks like this:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void dilate_test(const __global uchar* const pImgIn, __global uchar* pImgOut, const int width)
__local uchar buffer[bufferWidth * bufferHeight];
for (int a = 0; a < bufferHeight; ++a)
for (int b = 0; b < bufferWidth / 16; ++b)
vstore16(vload16(b, pImgIn + idx), b, buffer + idxBuffer);
// copy the rest of row with single uchars
// end of kernel for testing
I profile against OpenCV with a kernel size of 5 x 5 and an image of 4096+4 x 4096+4. Workgroup size is 16 x 16.
The timings I get are:
But OpenCL here only copys to local memory, while OpenCV does the full dilation.
The time for OpenCL here is just for enqueueNDRangeKernel to commandQueue.finish() no upload/download included. And I execute the kernel 30 time to "warmup" the environment like I read here a few times.
Is my code wrong (performance-wise)? OpenCL setup somehow not working correctly?
By running several samples I think I can be sure that the OpenCL setup is ok.