We just got a new cluster with AMD 6276 CPU with 16 cores.
We link the ACML library 5.1.0 with -L/sopt/acml5.1.0/ifort64_fma4_mp_int64/lib -lacml_mp
And use g++ to compile our test program, which is the multiplication of real matrices use dgemm with double precision.
However for 16 cores, we only have 60Glops, which is 4 times slower than the theoretical GLOPS of AMD 6276.
AMD 6276 suppose to have FMA4 instruction, so i check some web info that it should have 8 double precision/per clock (DP/clock).
Theoretically it should have 16 cores * 2.3G (frequency) * 8 DP/clock *0.85 efficiency~250Glops.
It seems like that our CPU only has 2 DP/per clock, similar to AMD 6275 CPU.
But I cat /proc/CPU and see that it do says AMD 6276 CPU.
When I test on intel sandy bridge CPU with intel MKL library, it has 8DP/clock. For AMD MagnyCours 2.1G with acml library, I have 4DP/clock.
These are as expected. But for AMD 6276, it is 4 times smaller than expected value.
Thus I am wondering if we compile the program correctly.
I am using ifort compiled acml library with g++ to compile our program. Should I use gfortran compiled acml library or anything else.
Thanks anyone for your comments.