AnsweredAssumed Answered

Cafemol: Poor Performance When Compiled with Open64

Question asked by capslockwizard on Jul 11, 2013

I would like CafeMol to run on a machine with 2 AMD Opterons 6128. Cafemol supports MPI + openMP. When I compile it with open64, the program scales poorly. It is able to make use of ~4 cores max. If I compile it with gcc 4.8.1 I am able to max out all 16 cores.

 

This is what I have done:

I have installed x86_open64-4.5.2.1-1.rhel5_sles10.x86_64 and compiled open-mpi 1.6.5 with the following configure flags:

CC=opencc CXX=openCC F77=openf90 FC=openf90 CCFLAGS='-march=auto -O3 -OPT:Ofast -fno-math-errno -ffast-math' CXXFLAGS='-march=auto -O3 -OPT:Ofast -fno-math-errno -ffast-math' F77FLAGS='-march=auto -O3 -OPT:Ofast -fno-math-errno -ffast-math' FCFLAGS='-march=auto -O3 -OPT:Ofast -fno-math-errno -ffast-math'

 

I compiled Cafemol by setting the following in the Makefile in src folder:

FC = mpif90

FC_UTIL = mpif90

CPP = -cpp -DTIME -DMPI_PAR -DMPI_PAR2 -DMPI_PAR3 -DMPI_REP

OPT = -mp -Ofast -fp-accuracy=relaxed -march=auto

NC =

LIB =

Note: Commented out flush(lunout) in a2rst.F90 in src folder

 

I then run an example provided by Cafemol by invoking the following:

time ./cafemol ./example/sh3/sh3.inp

 

These are the time readings provided by Cafemol and by time:

-----------------------------------

-----------------------------------

     force               356.432709

     _force(ope)         339.294219

     _force(mpi)          17.138491

     _force(local)        27.717727

     _force(go)            3.104825

     _force(pnl)           3.926098

     _force(ele)           0.357340

     _force(hp)            0.357872

     mpc                   0.000000

     _mpc(grid)            0.000000

     _mpc(rotate)          0.000000

     _mpc(velo)            0.000000

     neighbor              4.914133

     _neighbor(ope)        4.914133

     _neighbor(mpi)        0.000000

     _neighbor(pnl)        4.877032

     _neighbor(ele)        0.003128

     _neighbor(solv)       0.002947

     _neighbor(hp)         0.003028

     _neighbor(tail)       0.003073

     update                6.992024

     copyxyz               0.354448

     energy                3.294883

     _energy(ope)          3.129858

     _energy(mpi)          0.165025

     rmsd                  0.113921

     random               40.455686

     random(mpi)           0.000000

     replica               0.328312

     _replica(mpi)         0.000000

     stepadjust            0.000000

     _stepadj(mpi)         0.000000

     output                3.245324

     radiusg               0.013489

     muca                  0.332232

     implig                0.305379

-----------------------------------

               total     411.868406

-----------------------------------

                 ope     406.405065

                 mpi      17.303515

           main_loop     423.708580

-----------------------------------

-----------------------------------

 

real    7m5.049s

user    19m11.605s

sys     4m38.510s

 

===========================================================================

 

Using gcc to compile open-mpi and Cafemol with the following config:

open-mpi: -O2 -march=native -pipe -fomit-frame-pointer

Cafemol:

FC = mpif90

FC_UTIL = mpif90

CPP = -cpp -DTIME -DMPI_PAR -DMPI_PAR2 -DMPI_PAR3 -DMPI_REP

OPT = -O2 -march=native -pipe -fomit-frame-pointer -fno-range-check -ffree-line-length-none -fopenmp

NC =

LIB =

 

I get the following timings:

-----------------------------------

-----------------------------------

     force                26.121012

     _force(ope)          14.960998

     _force(mpi)          11.160015

     _force(local)         5.419119

     _force(go)            1.000391

     _force(pnl)           1.493911

     _force(ele)           0.127048

     _force(hp)            0.127328

     mpc                   0.000000

     _mpc(grid)            0.000000

     _mpc(rotate)          0.000000

     _mpc(velo)            0.000000

     neighbor              1.418003

     _neighbor(ope)        1.418003

     _neighbor(mpi)        0.000000

     _neighbor(pnl)        1.255963

     _neighbor(ele)        0.001346

     _neighbor(solv)       0.001254

     _neighbor(hp)         0.001182

     _neighbor(tail)       0.001187

     update                3.259652

     copyxyz               0.127916

     energy                0.266630

     _energy(ope)          0.233834

     _energy(mpi)          0.032796

     rmsd                  0.070642

     random               21.628498

     random(mpi)           0.000000

     replica               0.118458

     _replica(mpi)         0.000000

     stepadjust            0.000000

     _stepadj(mpi)         0.000000

     output                2.184139

     radiusg               0.004041

     muca                  0.122058

     implig                0.121037

-----------------------------------

               total      54.024084

-----------------------------------

                 ope      46.081507

                 mpi      11.192811

           main_loop      57.274318

-----------------------------------

-----------------------------------

 

real    0m58.732s

user    14m17.072s

sys     0m1.727s

 

===========================================================================

I also tried compiling Cafemol using less aggressive optimization flags: -O2 -march=auto but it didn't make a difference

 

Am I doing something wrong or is there a bug in open64's openmp implementation?

 

Thanks

Outcomes