Archives Discussions

ellenshobana · ‎10-15-2010

I am currently working on AMD OpenCL SDK v2.2. I have written a 2D FFT on OpenCL. The code was giving correct results on both CPU and GPU when the work group size was 64. I increased the work group size to 256. The modified code gives correct results while running on CPU but gives incorrect results when run on GPU. Also when I added a printf statement in the Kernel, the code gives correct results on GPU. If the code works fine on CPU, is it not expected that the code would work fine on GPU also?

nou · ‎10-15-2010

what GPU do you have? if you have 4xxx series then you can use only local size 64 because there is issue with barrier on larger workgroups.

MicahVillmow · ‎10-15-2010

ellenshobana,
This is not a correct assumption at all. The CPU runs all threads in a work-group sequentially. The GPU runs them in parallel. Most likely you have a race condition in your code with memory writes between threads. When you run w/ printf, the current implementation is to run a smaller group at a time reducing the likelihood of a race condition.

rolandman99 · ‎10-18-2010

Hi Micah,

My case is reverse. The code gives correct result on GPU, but not on CPU. Maybe because the reason you said ("The CPU runs all threads in a work-group sequentially. The GPU runs them in parallel. Most likely you have a race condition in your code with memory writes between threads"). Is it correct?

ellenshobana · ‎10-21-2010

I found the problem. The bug was that I was not passing enough local memory size to the kernel. This rise to another question. How come the CPU gives correct results even if we pass insufficient local memory size?

nou · ‎10-21-2010

you just write out of boundary of array. it is same as in C if you write beyond array size. it can work, return giberish result or crash.

MicahVillmow · ‎10-18-2010

This is a possibility where you are relying on the parallel execution for correctness and scalar execution produces different results.

MicahVillmow · ‎10-21-2010

ellenshobana,
The fact that it works on the CPU is pure coincidence. The GPU architecture is designed to drop out of bands memory writes or return zero on reads. So if you don't allocate enough memory, the results will be incorrect. This is a side effect of having non-uniform memory address spaces.

Archives Discussions

OpenCL code gives incorrect results on GPU but correct results on CPU