One thing to keep in mind with the ACML OpenMP version: ACML defaults to 1 thread, depending on which compiler is used. It's best to explicitly set OMP_NUM_THREADS to the right value, 8 in your case.
MKL defaults to as many threads as cores.
The current release of ACML has known performance issues with the GEMM kernels running on Intel hardware. Stay tuned for the next release....
Finally, DGEMV is largely a streaming problem, and is dependent on system memory bandwidth. You should find that ACML and MKL performance are close (don't forget number of threads), and track closely to number of memory channels and speed of the DRAM.
yeah I am aware of the deafault behavior. Both were compiled with OpenMP (well not the intel, as MKL can thread without the compiler enabling it), and were set OMP_NUM_THREADS=8.
That would be nice if ACML was a good default across both major x86-64 vendors even though it is an AMD product. Nice and portable then, this is in relation to your comment about DGEMM().
My concern more is about the relationship of the DGEMV() kernel. The E5530 should only have about 50% more memory bandwidth than the 2356. A 4x improvment on the Intel for both libraries was a big gain.
I should point out that I did a quick test of a C dgemv kernel I wrote with PGI/7.2 the performance (with a bunch of optimizations turned on), on both systems was oly around 800MFlop/s. So somthing both BLAS libraries were doing was helping the most to hit 4000MFlop.
Thanks for the input Chip
The Opteron model you list has a 512K L2 cache and a 2M L3. The Intel model has only 256K of L2, but 8M of L3.
Consider that DGEMV is reading the NxN (assuming square problems) matrix once with no reuse. Where the matrix data resides will determine how fast the problem can be computed, because the multiply and add can be done much faster than the matrix read operation. (Unless the data is in L1 cache, then DGEMV can start to reach the limitation of FPU performance.)
The 1010x1010 problem you are using fits in the larger L3 for the Intel processor, but the Opteron has to get it mostly from memory. If you run a problem of 2K or 4K, you should get results that correlate more to memory bandwidth, since those problem sizes will be from memory for both processors.
Plotting DGEMV performance for a range of problem sizes is a useful way to probe the cache hierarchy and effective bandwidth available from a platform. The plot provides clear breaks where a problem no longer fits in lower levels of cache.
Since you are asking about 8 threads, you must have at least a 2 socket machine for both. NUMA effects can come into play, which can affect the AMD platform more.