2 Replies Latest reply on Oct 29, 2013 2:20 AM by alec

    Inconsistent results from dgemm() with NUMA

    alec

      I am using ACML 5.3.1, 64-bit, on a NUMA machine and I get slightly different results from dgemm() depending on how many NUMA nodes I allow the program to run on.

       

      Multiplying two 1280x1280 matrices, the result is accurate and each node produces exactly the same result when the program is run on that node alone. But when running on three nodes, for example, the results are very slightly different. The difference is small, in about the fifth significant place. But should there be a difference at all?

       

      The machine is an HP ProLiant DL385 G7 with four NUMA nodes, each with six processors. I use numactl to control the nodes the program runs on. The OS is Ubuntu 12.05.

        • Re: Inconsistent results from dgemm() with NUMA
          chipf

          I guess you are calling the OpenMP version of the library, and specifying OMP_NUM_THREADS?

          How many threads are you using?

          And are you using numactl to determine placement of threads?  If you have fewer than 6 threads, does it make a difference if the threads are all on one node versus being split across more than one numa node?

          I wouldn't expect any of these to affect the results in a particular way.

           

          Can you share a code example describing how you call dgemm?

            • Re: Inconsistent results from dgemm() with NUMA
              alec

              Thanks for the response, Chip.

               

              I am using the ACML library extracted from acml-5-3-1-gfortran-64bit.tar. I don't specify OMP_NUM_THREADS because I want to use as many threads as possible. Running on one node does give me six threads, according to the Linux System Monitor, but it's hard to read the graph when using more than one node.

               

              I use numactl to control whole nodes rather than particular numbers of threads on particular nodes. I didn't even know you could control particular threads on particular nodes, and it's not something I need to do in real life, but if it would help troubleshoot this problem I can try it.

               

              I do have a small C program that demonstrates the problem. It works like this:

               

              1 Read a 1280x1280 matrix from a file. The matrix is orthonormal.

               

              2 Use dgemm() to multiply the matrix by its transpose. The result should be a 1280x1280 identity matrix.

               

              3 Subtract 1.0 from the diagonal of the result and print the sum of the squares of the elements.

               

              The printed sum of the squares comes out at 2.2914e-26 when run on one node, which is near enough to zero. But when run on three nodes the sum is 2.2911e-26. The discrepancy is there for different matrix sizes, although the exact numbers vary of course.

               

              I am happy to supply the code, which is only about 50 lines, and the data file, which is about 13MB, but I'm not sure how to do that on this site. Anyone who wants it can email me: alecdunn at ieee.org.