Hello all,

I have a big problem with performance of BLAS in the newest ACML library on Intel processors. I have a Intel Core i7-2620M CPU. The theoretical peak FP performance with AVX on a single CPU core is 27GFLOPs (the cpu has a 3.4GHz clock in Turbo mode). I tested the DGEMM function (dense matrix-matrix multiplication). With Intel MKL library I manage to achieve 43 GFLOPs running on 2 cores, which is a decent 80% of peak. With ACML I only manage to get 9 GFLOPs on 1 and 17 GFLOPs on 2 cores. I run the tests in MATLAB. Here are detailed results (3 runs for every system size tested, 4 different matrix sizes):

% Intel(R) Math Kernel Library Version 10.3.11 Product Build 20120606 for Intel(R) 64 architecture applications

% dim 1000, dgemm [gflops]: 31.4 37.3 37.0

% dim 2000, dgemm [gflops]: 41.6 40.9 41.8

% dim 3000, dgemm [gflops]: 42.8 42.9 43.0

% dim 4000, dgemm [gflops]: 43.3 42.9 42.9

% AMD Core Math Library(TM) Version 5.3.1.182

% dim 1000, dgemm [gflops]: 9.9 16.3 15.0

% dim 2000, dgemm [gflops]: 16.3 15.8 16.8

% dim 3000, dgemm [gflops]: 17.1 17.0 16.9

% dim 4000, dgemm [gflops]: 15.9 16.1 15.4

Does any of you know why the poor performance? I have seen this (old) thread posted earlier: http://devgurus.amd.com/thread/104976. Some people from AMD claimed that efforts are made to make sure ACML runs efficiently on other platforms. Is this still the case?

Thanks!

Hi Marcink~

I do not expect such poor performance; this will have to be debugged on our side.

How are you swapping the BLAS library underneath MatLab? Is there a script associated with your timings?

If you have the time and are interested, could you try your tests with ACML 4.4.0?

http://developer.amd.com/tools-and-sdks/cpu-development/amd-core-math-library-acml/acml-archive-downloads/

Kent