Yeah, you are right. Which means, I need a other method to perform reduction on global level. This is the reason, why that kernel is not working. Thanks a lot.
Is it possible to perform any other global reduction with more than one workgroup? It looks like, that using only one work group is the key to my problem.
So that the Tesla machine returned the right values, was just a coincidence. I will rewrite the code.
Thanks a lot, again.Now I know, what I shouldn't do in the future.
Mhh, what do you mean by "spawning only one workgroup"? You mean I spawn only one work group, which has the size of the array(therefore there is only one)? If I do this, my device runs out of memory.
It looks like I need a complete different approach, or I let the cpu do the final reduction.
No... What i meant was this:
I wrote some host code that will allocate memory , build program, set arguments and call your kernel.
In this host code, if I spawn only 1 workgroup, your code works fine.
The code-snippet that you complained about works fine on 7970 card here.
However, if I spawn multiple workgroups (i.e. increase the image size), the code does not return correct output.
This is because of the bug in the code that assumes global synchronization
So, as you rightly said, you can allow the CPU to do the final reduction.
(or) Spawn one another kernel with only 1 workgroup to do the final reduction alone