0 Replies Latest reply on Jul 11, 2013 2:15 AM by capslockwizard

    Cafemol: Poor Performance When Compiled with Open64

    capslockwizard

      I would like CafeMol to run on a machine with 2 AMD Opterons 6128. Cafemol supports MPI + openMP. When I compile it with open64, the program scales poorly. It is able to make use of ~4 cores max. If I compile it with gcc 4.8.1 I am able to max out all 16 cores.

       

      This is what I have done:

      I have installed x86_open64-4.5.2.1-1.rhel5_sles10.x86_64 and compiled open-mpi 1.6.5 with the following configure flags:

      CC=opencc CXX=openCC F77=openf90 FC=openf90 CCFLAGS='-march=auto -O3 -OPT:Ofast -fno-math-errno -ffast-math' CXXFLAGS='-march=auto -O3 -OPT:Ofast -fno-math-errno -ffast-math' F77FLAGS='-march=auto -O3 -OPT:Ofast -fno-math-errno -ffast-math' FCFLAGS='-march=auto -O3 -OPT:Ofast -fno-math-errno -ffast-math'

       

      I compiled Cafemol by setting the following in the Makefile in src folder:

      FC = mpif90

      FC_UTIL = mpif90

      CPP = -cpp -DTIME -DMPI_PAR -DMPI_PAR2 -DMPI_PAR3 -DMPI_REP

      OPT = -mp -Ofast -fp-accuracy=relaxed -march=auto

      NC =

      LIB =

      Note: Commented out flush(lunout) in a2rst.F90 in src folder

       

      I then run an example provided by Cafemol by invoking the following:

      time ./cafemol ./example/sh3/sh3.inp

       

      These are the time readings provided by Cafemol and by time:

      -----------------------------------

      -----------------------------------

           force               356.432709

           _force(ope)         339.294219

           _force(mpi)          17.138491

           _force(local)        27.717727

           _force(go)            3.104825

           _force(pnl)           3.926098

           _force(ele)           0.357340

           _force(hp)            0.357872

           mpc                   0.000000

           _mpc(grid)            0.000000

           _mpc(rotate)          0.000000

           _mpc(velo)            0.000000

           neighbor              4.914133

           _neighbor(ope)        4.914133

           _neighbor(mpi)        0.000000

           _neighbor(pnl)        4.877032

           _neighbor(ele)        0.003128

           _neighbor(solv)       0.002947

           _neighbor(hp)         0.003028

           _neighbor(tail)       0.003073

           update                6.992024

           copyxyz               0.354448

           energy                3.294883

           _energy(ope)          3.129858

           _energy(mpi)          0.165025

           rmsd                  0.113921

           random               40.455686

           random(mpi)           0.000000

           replica               0.328312

           _replica(mpi)         0.000000

           stepadjust            0.000000

           _stepadj(mpi)         0.000000

           output                3.245324

           radiusg               0.013489

           muca                  0.332232

           implig                0.305379

      -----------------------------------

                     total     411.868406

      -----------------------------------

                       ope     406.405065

                       mpi      17.303515

                 main_loop     423.708580

      -----------------------------------

      -----------------------------------

       

      real    7m5.049s

      user    19m11.605s

      sys     4m38.510s

       

      ===========================================================================

       

      Using gcc to compile open-mpi and Cafemol with the following config:

      open-mpi: -O2 -march=native -pipe -fomit-frame-pointer

      Cafemol:

      FC = mpif90

      FC_UTIL = mpif90

      CPP = -cpp -DTIME -DMPI_PAR -DMPI_PAR2 -DMPI_PAR3 -DMPI_REP

      OPT = -O2 -march=native -pipe -fomit-frame-pointer -fno-range-check -ffree-line-length-none -fopenmp

      NC =

      LIB =

       

      I get the following timings:

      -----------------------------------

      -----------------------------------

           force                26.121012

           _force(ope)          14.960998

           _force(mpi)          11.160015

           _force(local)         5.419119

           _force(go)            1.000391

           _force(pnl)           1.493911

           _force(ele)           0.127048

           _force(hp)            0.127328

           mpc                   0.000000

           _mpc(grid)            0.000000

           _mpc(rotate)          0.000000

           _mpc(velo)            0.000000

           neighbor              1.418003

           _neighbor(ope)        1.418003

           _neighbor(mpi)        0.000000

           _neighbor(pnl)        1.255963

           _neighbor(ele)        0.001346

           _neighbor(solv)       0.001254

           _neighbor(hp)         0.001182

           _neighbor(tail)       0.001187

           update                3.259652

           copyxyz               0.127916

           energy                0.266630

           _energy(ope)          0.233834

           _energy(mpi)          0.032796

           rmsd                  0.070642

           random               21.628498

           random(mpi)           0.000000

           replica               0.118458

           _replica(mpi)         0.000000

           stepadjust            0.000000

           _stepadj(mpi)         0.000000

           output                2.184139

           radiusg               0.004041

           muca                  0.122058

           implig                0.121037

      -----------------------------------

                     total      54.024084

      -----------------------------------

                       ope      46.081507

                       mpi      11.192811

                 main_loop      57.274318

      -----------------------------------

      -----------------------------------

       

      real    0m58.732s

      user    14m17.072s

      sys     0m1.727s

       

      ===========================================================================

      I also tried compiling Cafemol using less aggressive optimization flags: -O2 -march=auto but it didn't make a difference

       

      Am I doing something wrong or is there a bug in open64's openmp implementation?

       

      Thanks