3 Replies Latest reply on Apr 24, 2009 2:36 PM by chipf

    ACML 4.1.0 vs MKL 10.q opt2356 E5530 DGEMV()

    brockp

      I am seeing some interesting behavior of a new Itel chips, vs barcelonas, and MKL ACML for the DGEMV kernel,

       

      Problem size is 1010, openMP 8 cores

      dgemv()

      CPU         MFlop/s       BLAS Lib

      opt2356   838             ACML 4.1

      E5530    4435             ACML 4.1

      opt2356   858             MKL

      E5530    3743             MKL

       

      Strange thing is the DGEMM() kernel and DDOT() are about the same speeds on both systems.  With both BLAS libraries.  ACML has issues with dgemm() on the Intel and MKL has issues with dgemm() on the amd, no surpise.

       

      I expected the tripple channgel memory bandwdith of the Intel to show an 50% improvment in the ddot() and similar kernels, but am not.

       

      I do like the imporoved DGEMV() performance of the new intel platform, and I wish I would have tested it on a Shanghi, I also like how ACML is getting perofmrnace bumps in DGEMV() the same as MKL. Portability is nice must say.

       

      Any comments would be liked.

       
        • ACML 4.1.0 vs MKL 10.q opt2356 E5530 DGEMV()
          chipf

          One thing to keep in mind with the ACML OpenMP version:  ACML defaults to 1 thread, depending on which compiler is used.  It's best to explicitly set OMP_NUM_THREADS to the right value, 8 in your case.

          MKL defaults to as many threads as cores.

          The current release of ACML has known performance issues with the GEMM kernels running on Intel hardware.  Stay tuned for the next release....

          Finally, DGEMV is largely a streaming problem, and is dependent on system memory bandwidth.  You should find that ACML and MKL performance are close (don't forget number of threads), and track closely to number of memory channels and speed of the DRAM.

            • ACML 4.1.0 vs MKL 10.q opt2356 E5530 DGEMV()
              brockp

              yeah I am aware of the deafault behavior.  Both were compiled with OpenMP (well not the intel, as MKL can thread without the compiler enabling it), and were set OMP_NUM_THREADS=8.

               

              That would be nice if ACML was a good default across both major x86-64 vendors even though it is an AMD product.  Nice and portable then, this is in relation to your comment about DGEMM().

               

              My concern more is about the relationship of the DGEMV() kernel.  The E5530 should only have about 50% more memory bandwidth than the 2356.  A 4x improvment on the Intel for both libraries was a big gain.

               

              I should point out that I did a quick test of a C dgemv kernel I wrote with PGI/7.2 the performance (with a bunch of optimizations turned on), on both systems was oly around 800MFlop/s.  So somthing both BLAS libraries were doing was helping the most to hit 4000MFlop.

               

              Thanks for the input Chip

                • ACML 4.1.0 vs MKL 10.q opt2356 E5530 DGEMV()
                  chipf

                  The Opteron model you list has a  512K L2 cache and a 2M L3.  The Intel model has only 256K of L2, but 8M of L3.

                  Consider that DGEMV is reading the NxN (assuming square problems) matrix once with no reuse.  Where the matrix data resides will determine how fast the problem can be computed, because the multiply and add can be done much faster than the matrix read operation. (Unless the data is in L1 cache, then DGEMV can start to reach the limitation of FPU performance.)

                  The 1010x1010 problem you are using fits in the larger L3 for the Intel processor, but the Opteron has to get it mostly from memory.  If you run a problem of 2K or 4K, you should get results that correlate more to memory bandwidth, since those problem sizes will be from memory for both processors.

                  Plotting DGEMV performance for a range of problem sizes is a useful way to probe the cache hierarchy and  effective bandwidth available from a platform.  The plot provides clear breaks where a problem no longer fits in lower levels of cache.

                  Since you are asking about 8 threads, you must have at least a 2 socket machine for both.  NUMA effects can come into play, which can affect the AMD platform more.