I've implemented an application that pipelines the results of 4 kernels as such:
input -> Kernel1 -> Kernel2 -> Kernel3 -> Kernel4-> results
I also set callbacks for the terminations of kernels 2 and 4, that differ in the called function.
Each kernel has a unique kernel function: Kernel1 -> addOne; Kernel2 -> multByTwo; Kernel3 -> power; Kernel4 -> addFour.
If this pipeline is executed on a GPU, it works fine. However, if it is executed on a CPU it gives a segmentation fault in __OpenCL_addOne_stub() (specifically in instruction movss %xmm0,(%eax,%edx,4)). The ONLY modification I apply is changing from CL_DEVICE_TYPE_GPU to CL_DEVICE_TYPE_CPU.
Every API function that this application calls returns CL_SUCCESS, both in GPU or CPU execution. I've went over the hole code 3 times and I cannot find an error. Which is natural, I think, since this execution works fine on the GPU.
Is this a bug?
Thanks for your replies.
Make sure you are not having a buffer overrun in your code. A quick way to check is to overallocate memory and if the error goes away, that is the problem. If this is the problem, the reason it would not occur on the GPU is that memory is bounded and writes outside of bounds are dropped.
I simplified the scenario to facilitate debugging. I only write at the beginning, before Kernel1, to 1 buffer that has the size of 1 float (4 bytes). I've verified the offset of clEnqueueWriteBuffer as well as the size of the write (which is 4).
After that I overallocated the buffer, by 4000 bytes. Yet, the error remains.
While it is possible to find out what the problem is with just a binary, it is extremely difficult, whereas providing the source(if it is possible) or at least a debug mode binary with symbols, makes it vastly easier to determine what the problem is.
It seams that it is not a bug since I implemented a simplified version of my original application, and it works fine on both CPU and GPU. Do you have any other ideas that could explain this behaviour?