I have a kernel with two small local arrays
float a;float b;
I get the same results on CPU and GPU. Then I change the code to
__local float a; __local float b;
Suddenly I get very different results on GPU.
Why is this???
by defaults variables are in __private space which are per work item. then you chnge it to __local which is per work group.
so you most likely overwrite thing in your kernel. CPU is more serial than GPU so it do not show this bug in your code.
Shame on me!
Retrieving data ...