I'am trying to do an optimzed binary dilation on the GPU.
My setup is Ubuntu 10.04 64bit, sdk 2.4, HD 5850.
First of all I wanted to copy all the data for a workgroup into local memory.
Code basically looks like this:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void dilate_test(const __global uchar* const pImgIn, __global uchar* pImgOut, const int width)
__local uchar buffer[bufferWidth * bufferHeight];
for (int a = 0; a < bufferHeight; ++a)
for (int b = 0; b < bufferWidth / 16; ++b)
vstore16(vload16(b, pImgIn + idx), b, buffer + idxBuffer);
// copy the rest of row with single uchars
// end of kernel for testing
The timings I get are:
But OpenCL here only copys to local memory, while OpenCV does the full dilation.
The time for OpenCL here is just for enqueueNDRangeKernel to commandQueue.finish() no upload/download included. And I execute the kernel 30 time to "warmup" the environment like I read here a few times.
Is my code wrong (performance-wise)? OpenCL setup somehow not working correctly?
By running several samples I think I can be sure that the OpenCL setup is ok.
OpenCL is not a very suitable option when you don't have enough computations to be done on GPU. Global memory fetches are generally very time consuming.
I am not very versed with openCV so i can only guess.
I guess OpenCV would be using some differnet access patterns( Image like) which can be faster than linear access as you do in OpenCL. I suggest you to try using images in opencl, it might be helpful.