I am using ACML 5.3.1, 64-bit, on a NUMA machine and I get slightly different results from dgemm() depending on how many NUMA nodes I allow the program to run on.
Multiplying two 1280x1280 matrices, the result is accurate and each node produces exactly the same result when the program is run on that node alone. But when running on three nodes, for example, the results are very slightly different. The difference is small, in about the fifth significant place. But should there be a difference at all?
The machine is an HP ProLiant DL385 G7 with four NUMA nodes, each with six processors. I use numactl to control the nodes the program runs on. The OS is Ubuntu 12.05.