AnsweredAssumed Answered

Kernel with local memory usage gives different results on some hardware

Question asked by Raistmer on Mar 25, 2015
Latest reply on Apr 27, 2015 by dipak

Trying to speedup processing of few large arrays I used shared/local memory for splittling arrays to smaller blocks and to increase execution domain of kernel.

It wroks on on my dev host (C-60 Loveland) and also gives correct results on HD6950 GPU. But some testers report wrong computations on some GPUs.

 

So far tested:

C-60 Loveland with OpenCL 1.2 AMD-APP (1268.1) driver (Windows) - correct results

HD6950 with OpenCL 1.2 AMD-APP (1348.5) driver (Windows) - correct results

HD7970/Tahiti with Catalyst 14.9 (Windows) - invalid results

Tahiti LE with Catalyst 14.12/ OpenCL 1.2 AMD-APP (1642.5) driver  (Linux) - correct results

Hawaii Pro with Catalyst 14.9/  OpenCL 1.2 AMD-APP (1526.3) driver  (Linux)- invalid results

 

Not too clear is it driver version related issue or card architecture related or some issue with kernel's code itself.

 

Here is the kernel under question: http://pastebin.com/c9sX8Xwj

It has debug output enabled and different cards provide quite different outputs.

 

What is wrong here?

 

P.S. kernel's local domain is always {x,1,z} hence no local id(1) used inside kernel. Also, kernel produced correct results on HD7970 with workgroups/local domain of (1,1,64) and (4,1,1)(this one means no array splitting at all) but generated wrong results with (1,1,128).

Did not find any allowed WG configs that would fail on C-60 so far...

Outcomes