I guess you are calling the OpenMP version of the library, and specifying OMP_NUM_THREADS?
How many threads are you using?
And are you using numactl to determine placement of threads? If you have fewer than 6 threads, does it make a difference if the threads are all on one node versus being split across more than one numa node?
I wouldn't expect any of these to affect the results in a particular way.
Can you share a code example describing how you call dgemm?
Thanks for the response, Chip.
I am using the ACML library extracted from acml-5-3-1-gfortran-64bit.tar. I don't specify OMP_NUM_THREADS because I want to use as many threads as possible. Running on one node does give me six threads, according to the Linux System Monitor, but it's hard to read the graph when using more than one node.
I use numactl to control whole nodes rather than particular numbers of threads on particular nodes. I didn't even know you could control particular threads on particular nodes, and it's not something I need to do in real life, but if it would help troubleshoot this problem I can try it.
I do have a small C program that demonstrates the problem. It works like this:
1 Read a 1280x1280 matrix from a file. The matrix is orthonormal.
2 Use dgemm() to multiply the matrix by its transpose. The result should be a 1280x1280 identity matrix.
3 Subtract 1.0 from the diagonal of the result and print the sum of the squares of the elements.
The printed sum of the squares comes out at 2.2914e-26 when run on one node, which is near enough to zero. But when run on three nodes the sum is 2.2911e-26. The discrepancy is there for different matrix sizes, although the exact numbers vary of course.
I am happy to supply the code, which is only about 50 lines, and the data file, which is about 13MB, but I'm not sure how to do that on this site. Anyone who wants it can email me: alecdunn at ieee.org.