12 Replies Latest reply on Mar 5, 2010 5:39 PM by jkong

    open64 compiler and specOMP

    jkong
      Performance benchmarking with open64 and acml_mv

      I try to compare the latest open64 and acml math library with Sun Studio 12 update 1(ss12u1). The result shows open64 generated executables are 10-25% slower than ss12u1. Only on fma3d_m open64 gives better result.

      The options I use are:

      FOPTIMIZE = -Ofast -mp -HP -mso -LNOrefetch=3 -LNO:simd=1 -LNO:vintr=1 -march=barcelona
      COPTIMIZE = -Ofast -mp -HP -mso -LNOrefetch=3 -LNO:simd=1 -LNO:vintr=1 -march=barcelona
      EXTRA_LDFLAGS =
      EXTRA_LIBS= -I/opt/acml4.3.0/open64_64/include -L/opt/acml4.3.0/open64_64/lib -lacml_mv

      Has anyone run specOMP with open64? I searched spec result site and cannot find any result using open64.

      Thanks,

      Jun

        • open64 compiler and specOMP
          santosh-zan

           

          Thank you for letting us know.  We are in the process of measuring the SPEC OMP numbers and will get back to you soon.

          In the mean time can you share the SPEC config file for SunStudio used for your study?



           

          regards

          santosh

            • open64 compiler and specOMP
              jkong

              Thank you for the reply. 

              Attached please find my configuration file for using ss12u1. Please ignore the peak section because I am still working on that.

              Thanks again,

              Jun

              [jkong@view ~]$ cat 3leaf.cfg # Invocation command line: # runspec -c 3leaf.cfg --noreportable medium ############################################################################ ############################################################################ # # VENDOR = 3Leaf action = validate tune = base ext = ss12u1 input = ref env_vars = 1 reportable = 1 output_format = asc,config,raw teeout = yes teerunout = yes check_md5 = 1 #mean_anyway = 1 ###### Compiler used ################# default=default: CC=/opt/sun/sunstudio12.1/prod/bin/cc FC=/opt/sun/sunstudio12.1/prod/bin/f90 ######## Portability Flags and Environment variables ################## 318.galgel_m=default=default=default: FPORTABILITY = -e -fixed default=default=default=default: notes41000= Portablility flags: notes41002= 318.galgel_m : -e -fixed notes41004= notes41005= Extra art allowed flags: notes41006= 330.art_m : -DINTS_PER_CACHELINE=16 -DDBLS_PER_CACHELINE=8 notes41012= notes41013= Base and Peak User Environment: notes41014= export OMP_NUM_THREADS=16 notes41016= export SUNW_MP_PROCBIND=TRUE notes41017= export SUNW_MP_THR_IDLE=SPIN notes41018= export OMP_NESTED=FALSE notes41019= export OMP_WAIT_POLICY=active notes41020= export OMP_STACKSIZE=10M notes41021= export OMP_DYNAMIC=TRUE notes41022= ulimit -s unlimited notes41031= notes41036= Default BIOS settings used. notes41037= #################### SPEC OMPM2001 Portability flags ################# 330.art_m=default=default=default: EXTRA_CFLAGS = -DINTS_PER_CACHELINE=16 -DDBLS_PER_CACHELINE=8 #################### Baseline Optimization Flags ###################### medium=base=default=default: #FOPTIMIZE = -fast -xarch=generic -xautopar -xopenmp -xipo=2 -xvector=lib -m64 -aligncommon=16 -fns=no FOPTIMIZE = -fast -xarch=generic -xautopar -xopenmp -xipo=2 -m64 COPTIMIZE = -fast -fns=no -xalias_level=layout -xdepend=no -m64 -xopenmp EXTRA_LDFLAGS = EXTRA_LIBS= ONESTEP=yes default=default=default=default: notes121 = Compiler Invocation: notes122 = C : cc notes123 = F90 : f90 notes124 = F77 : f90 notes125 = notes126 = Base tuning: notes127 = Fortran : -fast -xarch=generic -xautopar -xopenmp -xipo=2 -xvector=lib notes128 = -m64 -aligncommon=16 -fns=no notes129 = C : -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 -m64 notes130 = ONESTEP=yes ######################### Peak Flags ############################# medium=peak=default=default: ONESTEP = yes notes300_0 = notes300_1 = Peak tuning: notes300_2 = ONESTEP=yes for all peak tests. notes300_3 = 310.wupwise_m=peak=default=default: #ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 #ENV_OMP_NUM_THREADS=8 OPTIMIZE = -fast -xarch=generic -xautopar -xopenmp -xipo=2 -xprefetch -xprefetch_level=3 -m64 notes310_1 = 310.wupwise_m : -fast -xarch=generic -xautopar -xopenmp -xipo=2 notes310_2 = -xvector=lib -xprefetch -xprefetch_level=3 -m64 notes310_3 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes310_4 = ENV_OMP_NUM_THREADS=8 312.swim_m=peak=default=default: srcalt=ompl.32 OPTIMIZE = -fast -xmodel=medium -Qoption ube -fsimple=3 -xipo=2 -m64 -xvector=simd -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 notes312_1 = 312.swim_m : -fast -xmodel=medium -Qoption ube -fsimple=3 -xipo=2 -m64 notes312_2 = -xvector=simd -xopenmp notes312_3 = srcalt = ompl.32 notes312_4 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes312_5 = ENV_OMP_NUM_THREADS=8 314.mgrid_m=peak=default=default: OPTIMIZE = -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 -m64 ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 notes314_1 = 314.mgrid_m : -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 -m64 notes314_2 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes314_3 = ENV_OMP_NUM_THREADS=8 316.applu_m=peak=default=default: OPTIMIZE = -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 -m64 -xmodel=medium srcalt=ompl ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 notes316_1 = 316.applu_m : -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 notes316_2 = -m64 -xmodel=medium notes316_3 = srcalt = ompl notes316_4 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes316_5 = ENV_OMP_NUM_THREADS=8 318.galgel_m=peak=default=default: OPTIMIZE = -O3 -xpagesize=2M -xipo=2 -xvector=simd -m64 -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 fdo_pre0 = rm -rf ./feedback.profile fdo_post1 = if [ ! -d ./feedback.profile ]; then exit 1; fi PASS1_FFLAGS = -xprofile=collect:./feedback PASS2_FFLAGS = -xprofile=use:./feedback PASS1_LDFLAGS = -xprofile=collect:./feedback PASS2_LDFLAGS = -xprofile=use:./feedback EXTRA_LIBS = -xlic_lib=sunperf RM_SOURCES = lapak.f90 notes318_1 = 318.galgel_m : -O3 -xpagesize=2M -xipo=2 -xvector=simd -m64 -xopenmp notes318_2 = -xlic_lib=sunperf +FDO notes318_3 = RM_SOURCES=lapak.f90 notes318_4 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes318_5 = ENV_OMP_NUM_THREADS=8 320.equake_m=peak=default=default: OPTIMIZE = -fast -fns=no -xalias_level=layout -xdepend=no -m64 -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 srcalt=ompl.32 notes320_1 = 320.equake_m : -fast -fns=no -xalias_level=layout -xdepend=no -m64 -xopenmp notes320_2 = srcalt = ompl.32 notes320_3 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes320_4 = ENV_OMP_NUM_THREADS=8 324.apsi_m=peak=default=default: OPTIMIZE = -fast -xipo=2 -m64 -xprefetch_level=3 -xvector -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 srcalt=ompl.32 notes324_1 = 324.apsi_m : -fast -xipo=2 -m64 -xprefetch_level=3 -xvector -xopenmp notes324_2 = srcalt = ompl.32 notes324_3 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes324_4 = ENV_OMP_NUM_THREADS=8 326.gafort_m=peak=default=default: OPTIMIZE = -fast -fstore -aligncommon=8 -xpagesize=2M -m64 -xipo=2 -xvector=simd -xopenmp notes326_1 = 326.gafort_m : -fast -fstore -aligncommon=8 -xpagesize=2M -m64 notes326_2 = -xipo=2 -xvector=simd -xopenmp 328.fma3d_m=peak=default=default: FOPTIMIZE = -fast -xipo=2 -m64 -xvector=simd -xopenmp srcalt=ompl.32 notes328_1 = 328.fma3d_m : -fast -xipo=2 -m64 -xvector=simd -xopenmp notes328_2 = srcalt = ompl.32 330.art_m=peak=default=default: OPTIMIZE = -fast -xipo=2 -m64 -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 notes330_1 = 330.art_m : -fast -xipo=2 -m64 -xopenmp notes330_2 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes330_3 = ENV_OMP_NUM_THREADS=8 332.ammp_m=peak=default=default: OPTIMIZE = -fast -xautopar -xipo=2 -xvector=simd -m64 -Wu,-fsimple=3 -xopenmp -xpagesize=2M notes332_1 = 332.ammp_m : -fast -xautopar -xipo=2 -xvector=simd -m64 -Wu,-fsimple=3 -xopenmp notes332_2 = -xpagesize=2M # # machine configuration # hw_vendor = Supermicro hw_model = hw_cpu = AMD hw_cpu_mhz = hw_fpu = hw_ncpu = hw_ncpuorder= hw_pcache = hw_scache = hw_tcache = hw_ocache = hw_memory = hw_disk = hw_avail = hw_other = sw_os = sw_compiler = Sun Studio Compiler sw_Kernel_Extensions = None sw_file = xfs sw_state = Multi-User sw_avail = Jun-2009 sw_parallel = OpenMP and Automatic parallel license_num = tester_name = 3Leaf test_date = Dec-2009 test_site = iSanta Clara company_name= 3Leaf Systems machine_name= prepared_by =

                • open64 compiler and specOMP
                  santosh-zan

                   

                   

                  Hello jkong,

                  Can you please share following details,

                   

                  Machine configuration ( /porc/cpuinfo and meminfo)

                  OS and its version used

                  OMP_NUM_THREADS variable is not used in your config file for base run. Are you using it in any other way?

                  Actual results or atleast the overall geomean scores, if it can be shared for comparison purposes.

                  Open64 Compiler version

                   

                  regards,

                  Santosh





                    • open64 compiler and specOMP
                      jkong

                      Please see the attached script for my test run. The OMP_NUM_THREADS is set in that script. The reason is because I would like to see the scalability as my machine grows from 4 to 31 cores

                      In this particular run, I used 7 CPUs from a two socket machine.

                      My actual result is at the bottom.

                       

                      Other info:

                      OS:

                      Since the OS was reinstalled, I don't remember exactly which version I installed. Probabaly it was SLES10 SP2.

                      I will re-run the benchmark on SLES11 and compare the results again

                      open64 Version:

                      x86_open64-4.2.3-1.x86_64.rpm

                      ACML: latest

                       

                      Last cpuinfo in /proc/cpuinfo:

                      processor    : 6
                      vendor_id    : AuthenticAMD
                      cpu family    : 16
                      model        : 4
                      model name    : Quad-Core AMD Opteron(tm) Processor 8382
                      stepping    : 2
                      cpu MHz        : 2613.388
                      cache size    : 512 KB
                      physical id    : 1
                      siblings    : 4
                      core id        : 3
                      cpu cores    : 4
                      apicid        : 7
                      initial apicid    : 7
                      fpu        : yes
                      fpu_exception    : yes
                      cpuid level    : 5
                      wp        : yes
                      flags        : fpu vme de pse tsc msr pae cx8 apic sep mtrr pge cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good nopl pni monitor cx16 popcnt lahf_lm abm sse4a 3dnowprefetch ibs sse5
                      bogomips    : 5229.16
                      TLB size    : 1024 4K pages
                      clflush size    : 64
                      cache_alignment    : 64
                      address sizes    : 48 bits physical, 48 bits virtual
                      power management:

                       

                      # cat /proc/meminfo


                      MemTotal:     48173652 kB
                      MemFree:      45738308 kB
                      Buffers:          1052 kB
                      Cached:         713364 kB
                      SwapCached:          0 kB
                      Active:        1880124 kB
                      Inactive:       384952 kB
                      SwapTotal:     1044216 kB
                      SwapFree:      1044216 kB
                      Dirty:              52 kB
                      Writeback:           0 kB
                      AnonPages:     1550840 kB
                      Mapped:          11376 kB
                      Slab:            30144 kB
                      SReclaimable:    17416 kB
                      SUnreclaim:      12728 kB
                      PageTables:       5224 kB
                      NFS_Unstable:        0 kB
                      Bounce:              0 kB
                      WritebackTmp:        0 kB
                      CommitLimit:  25131040 kB
                      Committed_AS:  1685160 kB
                      VmallocTotal: 34359738367 kB
                      VmallocUsed:    125924 kB
                      VmallocChunk: 34359612415 kB
                      HugePages_Total:     0
                      HugePages_Free:      0
                      HugePages_Rsvd:      0
                      HugePages_Surp:      0
                      Hugepagesize:     2048 kB
                      DirectMap4k:      8060 kB
                      DirectMap2M:  49205248 kB

                      These are the numbers I got:

                                                                                                                    ss12u1            open64

                       

                      wupwise_m252288
                      swim_m419450
                      mgrid_m404492
                      applu_m407402
                      galgel_m271
                      equake_m115124
                      apsi_m158242
                      gafort_m535
                      fma3d_m418358
                      art_m134138
                      ammp_m609668


                      #!/bin/sh . shrc CORES="7" BENCHMARK=medium CONFIG="3leaf.cfg" OPTIONS="--noreportable --iterations=1 --config=$CONFIG --size=ref" CPUTOTAL=$(grep "^processor" /proc/cpuinfo | tail -1 | awk '{print $NF}') export SUNW_MP_PROCBIND=TRUE export SUNW_MP_THR_IDLE=SPIN export OMP_WAIT_POLICY=active export OMP_STACKSIZE=10M export OMP_NESTED=FALSE export OMP_DYNAMIC=TRUE #export STACKSIZE=16384 #export OMP_NUM_THREADS=16 for core in $CORES do export OMP_NUM_THREADS=${core} export SUNW_MP_PROCBIND="$(( CPUTOTAL - core + 1 ))-${CPUTOTAL}" runspec ${OPTIONS} ${BENCHMARK} done

                        • open64 compiler and specOMP
                          jkong

                          On SLES 11, here are the results

                          open64 (4.2.3):

                             310.wupwise_m      6000       288     20861*
                             312.swim_m            6000       443     13539*
                             314.mgrid_m           7300       481     15188*
                             316.applu_m           4000       400      9999*
                             318.galgel_m                               X
                             320.equake_m        2600       124     20898*
                             324.apsi_m             3400       243     13975*
                             326.gafort_m                               X
                             328.fma3d_m          4600       354     12996*
                             330.art_m                6400       138     46420*
                             332.ammp_m          7000       665     10522*
                             Est. SPECompMbase2001                    --
                             Est. SPECompMpeak2001                              

                          ss12u1:

                             310.wupwise_m      6000       252     23831*
                             312.swim_m            6000       414     14506*
                             314.mgrid_m           7300       392     18646*
                             316.applu_m           4000       407      9833*
                             318.galgel_m           5100       270     18869*
                             320.equake_m        2600       114     22773*
                             324.apsi_m             3400       161     21156*
                             326.gafort_m          8700       537     16201*
                             328.fma3d_m          4600       420     10953*
                             330.art_m               6400       133     48227*
                             332.ammp_m          7000       601     11642*
                             Est. SPECompMbase2001                    --
                             Est. SPECompMpeak2001 

                          One difference is that Sun's compiler explicitly allows direct cpu binding. Not sure if that helps or not.                                  

                            • open64 compiler and specOMP
                              santosh-zan

                              Hello Jkong,

                              You can set the cpu binding in open64 compiler using "O64_OMP_AFFINITY_MAP" variable. For more information please see page 124 of the "Using the x86 Open64 Compiler Suite" document on following url.

                              http://developer.amd.com/cpu/open64/assets/x86_open64_user_guide.pdf

                              Can you please update us on your finding after using the cpu bindings?

                               

                               

                                • open64 compiler and specOMP
                                  santosh-zan

                                  The galgel failure may be because of not setting the slave stack to properly. Can you please set it the following way.

                                  export OMP_SLAVE_STACK_SIZE=22M

                                    • open64 compiler and specOMP
                                      jkong

                                      Following your suggestion, I set both Variables. One problem I had was that

                                      export OMP_SLAVE_STACK_SIZE=22M

                                      was too big. I changed the value to 2M so wupwise could be started. However, galgel still failed to run. Here are the result before galgel failure

                                       Success 310.wupwise_m ratio=20785.91, runtime=288.657137
                                       Success 312.swim_m ratio=13342.47, runtime=449.691920
                                       Success 314.mgrid_m ratio=14812.13, runtime=492.839398

                                      It doesn't seem that any improvment was made.

                                        • open64 compiler and specOMP
                                          santosh-zan

                                           

                                          1. Issues with galgel and gfort

                                          A . The OMP_SLAVE_STACK_SIZE needs to be set to atleast 10 MB. Otherwise it failed for me.  Can you please try this by setting it atleat to 10MB or more and check?

                                          B. Please set ulimit  to unlimited ie ulimit -s unlimited also.

                                          C. If you still continue to face issues, please do send us the content of the error files, OMPM2001/326.gafort_m/run/00000003/gafort.err and galel.err

                                           

                                          2. Performance issue:

                                          A. In our initial measurement, Open64 seems to be ~4% better, in the overall geomean. But lags by 10-25% in some individual benchmarks, as you have pointed out.

                                          B. If you get galgel and gafort to work , then please do check the overall geomean and let us know if you see the Open64 gaining overall by ~4%

                                          C. I will be filing a bug report and will let you know once we have resolved this.

                                           

                                          Thanks

                                          Santosh



                                            • open64 compiler and specOMP
                                              jkong

                                              May I have your specOMP configuration, or your compiler options? After changing OMP_SLAVE_STACK_SIZE to 22 MB, both gfort and galgel ran through. However, I got ~6% slower result. I also tried that -OPT:early_mp option, did not see much difference. The average ratio is 18452 from Open64 with

                                              Thank you,

                                              Jun

                                                • open64 compiler and specOMP
                                                  santosh-zan

                                                   

                                                  Hi Jun,

                                                  I have used the same flags what you had posted here for open64 compiler, listed below. And for Sun, we had used the configuration file you had provided. We did the thread binding also. Our experiments were on 2 sockets, 4 cores per socket and OMP threads were set to 8 on SLES 10 SP2 machine. 

                                                  FOPTIMIZE = -Ofast -mp -HP -mso -LNOrefetch=3 -LNO:simd=1 -LNO:vintr=1 -march=barcelona

                                                  COPTIMIZE = -Ofast -mp -HP -mso -LNOrefetch=3 -LNO:simd=1 -LNO:vintr=1 -march=barcelona

                                                  EXTRA_LIBS= -I/home/szanjurn/acml/open64_64_int64/include -L/home/szanjurn/acml/open64_64_int64/lib -lacml_mv

                                                  I have used both numctl and omp env variable for thread binding in the following way.

                                                  export O64_OMP_AFFINITY_MAP="7 6 5 3 4 2 1 0"

                                                  bind0=numactl --physcpubind=0,1,2,3,4,5,6,7 -l wupwise_base.opencc swim_base.opencc mgrid_base.opencc applu_base.opencc galgel_base.opencc equake_base.opencc apsi_base.opencc gafort_base.opencc fma3d_base.opencc art_base.opencc ammp_base.opencc

                                                  Open64 does well for galgel and gafort, do you see similar trends? Can you share the runtime and scores obtained for these two benchmarks? I am assuming the runtime and scores for the rest of the benchmarks are the same as what you had shared before.

                                                  We can exchange the config file offline. You can find a private messages waiting for you, on the left side of the Forum web page under "Private Messages" label, when you login to the forum.

                                                  santosh


                                                    • open64 compiler and specOMP
                                                      jkong

                                                      Using your configuration, I ran SPEC OMP again and indeed as you said, overall, open64 is slightly better than ss12u1. In particular, open64 has clear edge on galgel, fm3d and ammp.

                                                      Comparing the runs bewteen huge page enabled and disabled, when I run the benchmark with fewer threads and bind them to the same socket,  the results are almost the same. When I use 8 sockets, huge page enabled test is 0.24% slower overall, may be within fluctuation margin.  Not sure if that the huge page is not NUMA sensitive plays any role here.

                                                      The hupe page is enabled as such: -HP:heap=2m,bdt=2m