Hello,
I'am trying to do an optimzed binary dilation on the GPU.
My setup is Ubuntu 10.04 64bit, sdk 2.4, HD 5850.
First of all I wanted to copy all the data for a workgroup into local memory.
Code basically looks like this:
__kernel __attribute__((reqd_work_group_size(16, 16, 1))) void dilate_test(const __global uchar* const pImgIn, __global uchar* pImgOut, const int width) {
// initialization ...
__local uchar buffer[bufferWidth * bufferHeight]; for (int a = 0; a < bufferHeight; ++a) { for (int b = 0; b < bufferWidth / 16; ++b) { vstore16(vload16(b, pImgIn + idx), b, buffer + idxBuffer); } // copy the rest of row with single uchars ... } mem_fence(CLK_LOCAL_MEM_FENCE); // end of kernel for testing }
|
I profile against OpenCV with a kernel size of 5 x 5 and an image of 4096+4 x 4096+4. Workgroup size is 16 x 16.
The timings I get are:
0.059553 OpenCL
0.015248 OpenCV
But OpenCL here only copys to local memory, while OpenCV does the full dilation.
The time for OpenCL here is just for enqueueNDRangeKernel to commandQueue.finish() no upload/download included. And I execute the kernel 30 time to "warmup" the environment like I read here a few times.
Is my code wrong (performance-wise)? OpenCL setup somehow not working correctly?
update
By running several samples I think I can be sure that the OpenCL setup is ok.