I've written a simple program that pipelines 3 kernels ( bitsplicing -> dilation -> erossion)The tested image was Lena (not that it would matter but anyway). I use a threshold of 160 setting the bits to 0 or 1 before dilation/erosion.
In both dilation and erosion algorithms i get full black/white image on CPU if i set (up,down,left,right) pixels. If i set only 3 of them (eg up,left,right) i get a semi correct response. It is somewhat clear that the algorithm somehow overwrites values as if it would if it were in a for loop working on the same image, but i have 2 images (input,output).
Is this a compiler error or am i missing something ?
LE :
GPU worked without a problem (HD5470). CPU is Core i3. OpenCL SDK 2.2. Linux Ubuntu 10.10 64 bit
I've added the kernel code just to be clear how i do dilation or erosion ( & insted of | ) though it should make no difference since GPU output is ok whilst CPU is not.
__kernel void dilate(__global int* input,__global int* output,int width,int height)
{
uint x = get_global_id(0);
uint y = get_global_id(1);
// frame width
int frameW=LEVEL;
// frame array
int frameA=LEVEL*LEVEL;
int frameLeft=x;
int frameBottom=y;
int frameTop=((x+frameW) < (width)) ? (x+frameW): (width);
int frameRight=((y+frameW) < (height)) ? (y+frameW): (height);
// get center value (1/0)
int center=input[(frameLeft+1)+width*(frameBottom+1)]/255;
// dilate
output[(frameLeft+1)+width*(frameBottom+1)]=center*255; // center (1,1)
output[(frameLeft+0)+width*(frameBottom+1)]=((input[(frameLeft+0)+width*(frameBottom+1)] /255) | center )*255; // up (0,1)
output[(frameLeft+1)+width*(frameBottom+2)]=((input[(frameLeft+1)+width*(frameBottom+2)] /255) | center )*255; // down (2,1)
output[(frameLeft+1)+width*(frameBottom+0)]=((input[(frameLeft+1)+width*(frameBottom+0)] /255) | center )*255; // left (1,0)
output[(frameLeft+2)+width*(frameBottom+1)]=((input[(frameLeft+2)+width*(frameBottom+1)] /255) | center )*255; // right (1,2)
}
I managed to find out the problem. I pipelined by passing in the same memory address (output) -big oops-to the kernel and whilst the CPU failed by reading/writing from its memory the GPU executed correctly. Whats more interesting is that the GPU managed to read correctly the output memory which was of type WRITE_ONLY.
Hence, my bad, but interesting enough the GPU managed to correctly output the intended result.
lupescu_grigore,
Thanks for sharing this issue.
A buffer created using CL_MEM_WRITE_ONLY flag should raise an error. The issue has been reported to Developers.
If i use only CL_MEM_READ_WRITE instead of READ_ONLY+WRITE_ONLY how does that affect the app ? Is there a performance difference ?
I checked out by using a clEnqueueWriteBuffer command to write a 4kb data and did 25 iterations:
The mean time was as found:
0.114ms for CL_MEM_READ_WRITE.
0.122ms CL_MEM_WRITE_ONLY
So it seems flags doesn't affect performance much.