We just got a new cluster with AMD 6276 CPU with 16 cores.

We link the ACML library 5.1.0 with -L/sopt/acml5.1.0/ifort64_fma4_mp_int64/lib -lacml_mp

And use g++ to compile our test program, which is the multiplication of real matrices use dgemm with double precision.

However for 16 cores, we only have 60Glops, which is 4 times slower than the theoretical GLOPS of AMD 6276.

AMD 6276 suppose to have FMA4 instruction, so i check some web info that it should have 8 double precision/per clock (DP/clock).

Theoretically it should have 16 cores * 2.3G (frequency) * 8 DP/clock *0.85 efficiency~250Glops.

It seems like that our CPU only has 2 DP/per clock, similar to AMD 6275 CPU.

But I cat /proc/CPU and see that it do says AMD 6276 CPU.

When I test on intel sandy bridge CPU with intel MKL library, it has 8DP/clock. For AMD MagnyCours 2.1G with acml library, I have 4DP/clock.

These are as expected. But for AMD 6276, it is 4 times smaller than expected value.

Thus I am wondering if we compile the program correctly.

I am using ifort compiled acml library with g++ to compile our program. Should I use gfortran compiled acml library or anything else.

Thanks anyone for your comments.

I think I kind of figure out why. The AMD 6276 CPU 16 integer units, but only has 8 floating point units. Thus

8(FPU units)*2.3G*8 DP/clock *0.85 efficiency~120Glops.

Thus the test result is OK. Just got deceived by the 16 cores CPU, for float point calculation only 8 float point units will participate in the dgemm test.