I am currently working on AMD OpenCL SDK v2.2. I have written a 2D FFT on OpenCL. The code was giving correct results on both CPU and GPU when the work group size was 64. I increased the work group size to 256. The modified code gives correct results while running on CPU but gives incorrect results when run on GPU. Also when I added a printf statement in the Kernel, the code gives correct results on GPU. If the code works fine on CPU, is it not expected that the code would work fine on GPU also?
My case is reverse. The code gives correct result on GPU, but not on CPU. Maybe because the reason you said ("The CPU runs all threads in a work-group sequentially. The GPU runs them in parallel. Most likely you have a race condition in your code with memory writes between threads"). Is it correct?
I found the problem. The bug was that I was not passing enough local memory size to the kernel. This rise to another question. How come the CPU gives correct results even if we pass insufficient local memory size?