Kernel with local memory usage gives different results on some hardware

Trying to speedup processing of few large arrays I used shared/local memory for splittling arrays to smaller blocks and to increase execution domain of kernel.

It wroks on on my dev host (C-60 Loveland) and also gives correct results on HD6950 GPU. But some testers report wrong computations on some GPUs.


So far tested:

C-60 Loveland with OpenCL 1.2 AMD-APP (1268.1) driver (Windows) - correct results

HD6950 with OpenCL 1.2 AMD-APP (1348.5) driver (Windows) - correct results

HD7970/Tahiti with Catalyst 14.9 (Windows) - invalid results

Tahiti LE with Catalyst 14.12/ OpenCL 1.2 AMD-APP (1642.5) driver  (Linux) - correct results

Hawaii Pro with Catalyst 14.9/  OpenCL 1.2 AMD-APP (1526.3) driver  (Linux)- invalid results


Not too clear is it driver version related issue or card architecture related or some issue with kernel's code itself.


Here is the kernel under question:

It has debug output enabled and different cards provide quite different outputs.


What is wrong here?


P.S. kernel's local domain is always {x,1,z} hence no local id(1) used inside kernel. Also, kernel produced correct results on HD7970 with workgroups/local domain of (1,1,64) and (4,1,1)(this one means no array splitting at all) but generated wrong results with (1,1,128).

Did not find any allowed WG configs that would fail on C-60 so far...