I am currently working on AMD OpenCL SDK v2.2. I have written a 2D FFT on OpenCL. The code was giving correct results on both CPU and GPU when the work group size was 64. I increased the work group size to 256. The modified code gives correct results while running on CPU but gives incorrect results when run on GPU. Also when I added a printf statement in the Kernel, the code gives correct results on GPU. If the code works fine on CPU, is it not expected that the code would work fine on GPU also?
what GPU do you have? if you have 4xxx series then you can use only local size 64 because there is issue with barrier on larger workgroups.
My case is reverse. The code gives correct result on GPU, but not on CPU. Maybe because the reason you said ("The CPU runs all threads in a work-group sequentially. The GPU runs them in parallel. Most likely you have a race condition in your code with memory writes between threads"). Is it correct?
I found the problem. The bug was that I was not passing enough local memory size to the kernel. This rise to another question. How come the CPU gives correct results even if we pass insufficient local memory size?
you just write out of boundary of array. it is same as in C if you write beyond array size. it can work, return giberish result or crash.