0 Replies Latest reply on Dec 8, 2017 3:44 PM by drnil

    Benchmarking float64 matrix multiplication performance

    drnil

      My primary interest in GPUs is for "scientific computing", or more precisely speaking, float64 general matrix multiplications, also known as DGEMM. This is the speed determining factor in my applications - if DGEMM runs N times faster, my programs will also run N times faster. Double precision is a necessity here - things will just not work in single precision.

       

      I found that published FLOPS numbers do not reflect the speed of my matrix multiplications very well, so curious about what performance to expect, and how performance has developed during the last five years, I benchmarked SGEMM/DGEMM for a few configurations and the results are shown in the table below. I measured the elapsed time of the multiplication of two 2400x2400 matrices consisting of uniformly distributed random numbers between 0 and 10 ("DGEMM2400"). For this benchmark, I used code form the MatrixTranspose_standalone package provided by dipak and also the MatrixMultiply source code in AMD APP SDK 3.0.  Originally, it was written for float32 (single) precision, but I modified it to handle float64 (double) precision. You can find the complete source code attached to this message (MatrixMultiplyDouble_standalone.zip).  The AMD code can also run on the CPU, but this code is not optimized and does not use multicore, so for comparison, I also wrote a corresponding DGEMM test in Python/Numpy (matrixmultiply.py, attached). The elapsed times were as follows:

      Hardwarefrom year
      float32
      (SGEMM2400)
      float64
      (DGEMM2400)
      Software
      Radeon HD 787020120.10s0.37sAMD Open CL 1.2, Ubuntu 14.04
      Intel i5-3570K @ 3.6GHz (1 core)20122.9s3.8sAMD C++, Ubuntu 14.04
      Intel i5-3570K @ 3.6GHz (4 cores)20120.14s0.28sNumpy, Ubuntu 16.04
      Intel i7-5600U @ 2.6GHz (2 cores)20150.19s0.40sNumpy, Windows 7

       

      As far as I can see, the AMD C++ code is a solid, straightforward implementation for single core, whereas the Anaconda3 distribution of Numpy is highly optimized and uses the Intel MKL libraries, which take advantage of a variety of special instructions, cache structure, etc, and efficiently distribute work over multiple cores, explaining the differences between the corresponding benchmark runs.

       

      I invite others to inspect and run these benchmarks on other hardware, and wish that you present the results in this thread. In particular, I would be very interested in measurements for the more recent GPU families.