5 Replies Latest reply on Feb 18, 2013 2:16 PM by chandra

    AMD 6276 with ACML library has lower GFLOPS?


      We just got a new cluster with AMD 6276 CPU with 16 cores.

      We link the ACML library 5.1.0 with  -L/sopt/acml5.1.0/ifort64_fma4_mp_int64/lib -lacml_mp

      And use g++ to compile our test program, which is the multiplication of real matrices use dgemm with double precision.


      However for 16 cores, we only have 60Glops, which is 4 times slower than the theoretical GLOPS of AMD 6276.

      AMD 6276 suppose to have FMA4 instruction, so i check some web info that it should have 8 double precision/per clock (DP/clock).

      Theoretically it should have 16 cores * 2.3G (frequency) * 8 DP/clock *0.85 efficiency~250Glops.

      It seems like that our CPU only has 2 DP/per clock, similar to AMD 6275 CPU.

      But I cat /proc/CPU and see that it do says AMD 6276 CPU.


      When I test on intel sandy bridge CPU with intel MKL library, it has 8DP/clock. For AMD MagnyCours 2.1G with acml library, I have 4DP/clock.

      These are as expected. But for AMD 6276, it is 4 times smaller than expected value.

      Thus I am wondering if we compile the program correctly.

      I am using ifort compiled acml library with g++ to compile our program. Should I use gfortran compiled acml library or anything else.

      Thanks anyone for your comments.

        • Re: AMD 6276 with ACML library has lower GFLOPS?

          I notice you are using the shared object library.  You most likely have run into a interaction of the Linux address space randomization feature and the Bulldozer architecture.


          One possible workaround is to use the static library instead of the shared object library.  There are other work arounds that involve turning off the ASLR feature on your linux machine, as described here: http://gcc.gnu.org/wiki/Randomization

          1 of 1 people found this helpful
            • Re: AMD 6276 with ACML library has lower GFLOPS?

              Hi chip:


              Thanks so much for the reply.


              Now i link the static library with

              -L/sopt/acml5.1.0/ifort64_fma4_mp_int64/lib -static -lacml_mp -lifcore

              And compile with icpc -openmp.


              Just realize that lifcore and openmp is necessary for many undefined references.

              And I turn off the randomization with:

              setarch i386 -RL bash


              And these compiled through, but the new issue is that it has segment fault when I run the degemm test, it looks like still something is not right? You mind tell me if I link the static library correctly?

                • Re: AMD 6276 with ACML library has lower GFLOPS?

                  I haven't used the setarch command, instead I just modify /etc/sysctl.conf.

                  But I think the command you want would be

                  setarch x86_64 -R bash


                  The i386 option will set the bash shell to 32-bit mode. Definitely not what you want!


                  You can find examples of static linking by building the examples in the example directory.  Some of them use static linking, and you can adapt the same syntax to build your application.

                  1 of 1 people found this helpful
                    • Re: AMD 6276 with ACML library has lower GFLOPS?

                      Hi Chip:


                      Thanks for your information.

                      When go to the example file under following directory.



                      And  "make OMP_NUM_THREADS=16"

                      When it compile and run the dgemm.f90 test, i change the matrix dimension in dgemm.f90 file and run from 1, 2, 4, 8 and 16 threads.


                      # Results with 1 threads

                                         500   12114.2214078450

                                        1000   14491.3563624075

                                        1500   14505.0838320353

                                        2000   15243.0592346968

                                        2500   14764.5782145112

                                        3000   15277.2349384843

                                        3500   15074.6470556887

                                        4000   15010.9390629373

                                        4500   15255.9165268833

                                        5000   15303.5362981342

                                        5500   15270.7656188921


                      # Results with 2 threads

                                         500   19355.8991631980

                                        1000   27127.8230196018

                                        1500   27993.1120853415

                                        2000   29082.0422934724

                                        2500   28655.2092246622

                                        3000   29018.1272350259

                                        3500   29128.1074229737

                                        4000   29633.0058304796

                                        4500   29206.1158587998

                                        5000   29916.8293547636

                                        5500   30037.2119857454


                      # Results with 4 threads

                                         500   23848.8107504334

                                        1000   50735.4174867835

                                        1500   54055.9587284757

                                        2000   56834.8582877690

                                        2500   55963.1206867029

                                        3000   57305.4206142899

                                        3500   57305.1666835484

                                        4000   58959.6169595254

                                        4500   57472.1922343080

                                        5000   57897.3853707974

                                        5500   59689.6031785276


                      # Results with 8 threads

                                         500   17722.8881814069

                                        1000   48142.3480650363

                                        1500   57848.7371050773

                                        2000   63192.1670086192

                                        2500   62777.6171876108

                                        3000   64182.2241931659

                                        3500   65109.5786730007

                                        4000   67984.3496712820

                                        4500   64377.4637299524

                                        5000   64382.9262905409

                                        5500   65798.0990399046


                      # Results with 16 threads

                                         500   7644.30919249279

                                        1000   62210.6253321853

                                        1500   89909.4396160411

                                        2000   105961.347671255

                                        2500   107723.920214721

                                        3000   109808.363663827

                                        3500   111816.793815249

                                        4000   118117.480471053

                                        4500   113015.192942133

                                        5000   115785.588841259

                                        5500   121489.889557355

                      I think the 1st column is the size of matrix and the second column is the Gflops.


                      As you can see that for a single thread, it has 15Gflops=2.3G*8 flops/clock*0.8 efficiency.

                      So it has expected 8flops per clock. 2 threads has 30Glops with 4 threads has 60Gflops.

                      But it's bad to see that 8 threads also has 60Glops instead of 120 Gflops. And 16 threads result has the same

                      issue with 120Glops not 240Gflops. BTW, I already use

                      setarch x86_64 -R bash

                      to disable the random memory access.


                      So it seems that AMD 6276 CPU can't work efficiently with more than 4 threads.

                      Do you also have the AMD 6276 CPU and  can you run the dgemm test?

                      I just want to make sure that nothing wrong with our CPU.


                      Thanks if you could run the test.

                        • Re: AMD 6276 with ACML library has lower GFLOPS?

                          I think I kind of figure out why. The AMD 6276 CPU 16 integer units, but only has 8 floating point units. Thus

                          8(FPU units)*2.3G*8 DP/clock *0.85 efficiency~120Glops.

                          Thus the test result is OK. Just got deceived by the 16 cores CPU, for float point calculation only 8 float point units will participate in the dgemm test.