I've implemented an application that pipelines the results of 4 kernels as such:
input -> Kernel1 -> Kernel2 -> Kernel3 -> Kernel4-> results
I also set callbacks for the terminations of kernels 2 and 4, that differ in the called function.
Each kernel has a unique kernel function: Kernel1 -> addOne; Kernel2 -> multByTwo; Kernel3 -> power; Kernel4 -> addFour.
If this pipeline is executed on a GPU, it works fine. However, if it is executed on a CPU it gives a segmentation fault in __OpenCL_addOne_stub() (specifically in instruction movss %xmm0,(%eax,%edx,4)). The ONLY modification I apply is changing from CL_DEVICE_TYPE_GPU to CL_DEVICE_TYPE_CPU.
Every API function that this application calls returns CL_SUCCESS, both in GPU or CPU execution. I've went over the hole code 3 times and I cannot find an error. Which is natural, I think, since this execution works fine on the GPU.
Is this a bug?
Thanks for your replies.