cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jkong
Journeyman III

open64 compiler and specOMP

Performance benchmarking with open64 and acml_mv

I try to compare the latest open64 and acml math library with Sun Studio 12 update 1(ss12u1). The result shows open64 generated executables are 10-25% slower than ss12u1. Only on fma3d_m open64 gives better result.

The options I use are:

FOPTIMIZE = -Ofast -mp -HP -mso -LNOrefetch=3 -LNO:simd=1 -LNO:vintr=1 -march=barcelona
COPTIMIZE = -Ofast -mp -HP -mso -LNOrefetch=3 -LNO:simd=1 -LNO:vintr=1 -march=barcelona
EXTRA_LDFLAGS =
EXTRA_LIBS= -I/opt/acml4.3.0/open64_64/include -L/opt/acml4.3.0/open64_64/lib -lacml_mv

Has anyone run specOMP with open64? I searched spec result site and cannot find any result using open64.

Thanks,

Jun

0 Likes
12 Replies
santosh-zan
Journeyman III

Thank you for letting us know.  We are in the process of measuring the SPEC OMP numbers and will get back to you soon.

In the mean time can you share the SPEC config file for SunStudio used for your study?



 

regards

santosh

0 Likes

Thank you for the reply. 

Attached please find my configuration file for using ss12u1. Please ignore the peak section because I am still working on that.

Thanks again,

Jun

[jkong@view ~]$ cat 3leaf.cfg # Invocation command line: # runspec -c 3leaf.cfg --noreportable medium ############################################################################ ############################################################################ # # VENDOR = 3Leaf action = validate tune = base ext = ss12u1 input = ref env_vars = 1 reportable = 1 output_format = asc,config,raw teeout = yes teerunout = yes check_md5 = 1 #mean_anyway = 1 ###### Compiler used ################# default=default: CC=/opt/sun/sunstudio12.1/prod/bin/cc FC=/opt/sun/sunstudio12.1/prod/bin/f90 ######## Portability Flags and Environment variables ################## 318.galgel_m=default=default=default: FPORTABILITY = -e -fixed default=default=default=default: notes41000= Portablility flags: notes41002= 318.galgel_m : -e -fixed notes41004= notes41005= Extra art allowed flags: notes41006= 330.art_m : -DINTS_PER_CACHELINE=16 -DDBLS_PER_CACHELINE=8 notes41012= notes41013= Base and Peak User Environment: notes41014= export OMP_NUM_THREADS=16 notes41016= export SUNW_MP_PROCBIND=TRUE notes41017= export SUNW_MP_THR_IDLE=SPIN notes41018= export OMP_NESTED=FALSE notes41019= export OMP_WAIT_POLICY=active notes41020= export OMP_STACKSIZE=10M notes41021= export OMP_DYNAMIC=TRUE notes41022= ulimit -s unlimited notes41031= notes41036= Default BIOS settings used. notes41037= #################### SPEC OMPM2001 Portability flags ################# 330.art_m=default=default=default: EXTRA_CFLAGS = -DINTS_PER_CACHELINE=16 -DDBLS_PER_CACHELINE=8 #################### Baseline Optimization Flags ###################### medium=base=default=default: #FOPTIMIZE = -fast -xarch=generic -xautopar -xopenmp -xipo=2 -xvector=lib -m64 -aligncommon=16 -fns=no FOPTIMIZE = -fast -xarch=generic -xautopar -xopenmp -xipo=2 -m64 COPTIMIZE = -fast -fns=no -xalias_level=layout -xdepend=no -m64 -xopenmp EXTRA_LDFLAGS = EXTRA_LIBS= ONESTEP=yes default=default=default=default: notes121 = Compiler Invocation: notes122 = C : cc notes123 = F90 : f90 notes124 = F77 : f90 notes125 = notes126 = Base tuning: notes127 = Fortran : -fast -xarch=generic -xautopar -xopenmp -xipo=2 -xvector=lib notes128 = -m64 -aligncommon=16 -fns=no notes129 = C : -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 -m64 notes130 = ONESTEP=yes ######################### Peak Flags ############################# medium=peak=default=default: ONESTEP = yes notes300_0 = notes300_1 = Peak tuning: notes300_2 = ONESTEP=yes for all peak tests. notes300_3 = 310.wupwise_m=peak=default=default: #ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 #ENV_OMP_NUM_THREADS=8 OPTIMIZE = -fast -xarch=generic -xautopar -xopenmp -xipo=2 -xprefetch -xprefetch_level=3 -m64 notes310_1 = 310.wupwise_m : -fast -xarch=generic -xautopar -xopenmp -xipo=2 notes310_2 = -xvector=lib -xprefetch -xprefetch_level=3 -m64 notes310_3 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes310_4 = ENV_OMP_NUM_THREADS=8 312.swim_m=peak=default=default: srcalt=ompl.32 OPTIMIZE = -fast -xmodel=medium -Qoption ube -fsimple=3 -xipo=2 -m64 -xvector=simd -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 notes312_1 = 312.swim_m : -fast -xmodel=medium -Qoption ube -fsimple=3 -xipo=2 -m64 notes312_2 = -xvector=simd -xopenmp notes312_3 = srcalt = ompl.32 notes312_4 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes312_5 = ENV_OMP_NUM_THREADS=8 314.mgrid_m=peak=default=default: OPTIMIZE = -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 -m64 ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 notes314_1 = 314.mgrid_m : -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 -m64 notes314_2 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes314_3 = ENV_OMP_NUM_THREADS=8 316.applu_m=peak=default=default: OPTIMIZE = -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 -m64 -xmodel=medium srcalt=ompl ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 notes316_1 = 316.applu_m : -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 notes316_2 = -m64 -xmodel=medium notes316_3 = srcalt = ompl notes316_4 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes316_5 = ENV_OMP_NUM_THREADS=8 318.galgel_m=peak=default=default: OPTIMIZE = -O3 -xpagesize=2M -xipo=2 -xvector=simd -m64 -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 fdo_pre0 = rm -rf ./feedback.profile fdo_post1 = if [ ! -d ./feedback.profile ]; then exit 1; fi PASS1_FFLAGS = -xprofile=collect:./feedback PASS2_FFLAGS = -xprofile=use:./feedback PASS1_LDFLAGS = -xprofile=collect:./feedback PASS2_LDFLAGS = -xprofile=use:./feedback EXTRA_LIBS = -xlic_lib=sunperf RM_SOURCES = lapak.f90 notes318_1 = 318.galgel_m : -O3 -xpagesize=2M -xipo=2 -xvector=simd -m64 -xopenmp notes318_2 = -xlic_lib=sunperf +FDO notes318_3 = RM_SOURCES=lapak.f90 notes318_4 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes318_5 = ENV_OMP_NUM_THREADS=8 320.equake_m=peak=default=default: OPTIMIZE = -fast -fns=no -xalias_level=layout -xdepend=no -m64 -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 srcalt=ompl.32 notes320_1 = 320.equake_m : -fast -fns=no -xalias_level=layout -xdepend=no -m64 -xopenmp notes320_2 = srcalt = ompl.32 notes320_3 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes320_4 = ENV_OMP_NUM_THREADS=8 324.apsi_m=peak=default=default: OPTIMIZE = -fast -xipo=2 -m64 -xprefetch_level=3 -xvector -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 srcalt=ompl.32 notes324_1 = 324.apsi_m : -fast -xipo=2 -m64 -xprefetch_level=3 -xvector -xopenmp notes324_2 = srcalt = ompl.32 notes324_3 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes324_4 = ENV_OMP_NUM_THREADS=8 326.gafort_m=peak=default=default: OPTIMIZE = -fast -fstore -aligncommon=8 -xpagesize=2M -m64 -xipo=2 -xvector=simd -xopenmp notes326_1 = 326.gafort_m : -fast -fstore -aligncommon=8 -xpagesize=2M -m64 notes326_2 = -xipo=2 -xvector=simd -xopenmp 328.fma3d_m=peak=default=default: FOPTIMIZE = -fast -xipo=2 -m64 -xvector=simd -xopenmp srcalt=ompl.32 notes328_1 = 328.fma3d_m : -fast -xipo=2 -m64 -xvector=simd -xopenmp notes328_2 = srcalt = ompl.32 330.art_m=peak=default=default: OPTIMIZE = -fast -xipo=2 -m64 -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 notes330_1 = 330.art_m : -fast -xipo=2 -m64 -xopenmp notes330_2 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes330_3 = ENV_OMP_NUM_THREADS=8 332.ammp_m=peak=default=default: OPTIMIZE = -fast -xautopar -xipo=2 -xvector=simd -m64 -Wu,-fsimple=3 -xopenmp -xpagesize=2M notes332_1 = 332.ammp_m : -fast -xautopar -xipo=2 -xvector=simd -m64 -Wu,-fsimple=3 -xopenmp notes332_2 = -xpagesize=2M # # machine configuration # hw_vendor = Supermicro hw_model = hw_cpu = AMD hw_cpu_mhz = hw_fpu = hw_ncpu = hw_ncpuorder= hw_pcache = hw_scache = hw_tcache = hw_ocache = hw_memory = hw_disk = hw_avail = hw_other = sw_os = sw_compiler = Sun Studio Compiler sw_Kernel_Extensions = None sw_file = xfs sw_state = Multi-User sw_avail = Jun-2009 sw_parallel = OpenMP and Automatic parallel license_num = tester_name = 3Leaf test_date = Dec-2009 test_site = iSanta Clara company_name= 3Leaf Systems machine_name= prepared_by =

0 Likes

Hello jkong,

Can you please share following details,

 

Machine configuration ( /porc/cpuinfo and meminfo)

OS and its version used

OMP_NUM_THREADS variable is not used in your config file for base run. Are you using it in any other way?

Actual results or atleast the overall geomean scores, if it can be shared for comparison purposes.

Open64 Compiler version

 

regards,

Santosh





0 Likes

Please see the attached script for my test run. The OMP_NUM_THREADS is set in that script. The reason is because I would like to see the scalability as my machine grows from 4 to 31 cores

In this particular run, I used 7 CPUs from a two socket machine.

My actual result is at the bottom.

 

Other info:

OS:

Since the OS was reinstalled, I don't remember exactly which version I installed. Probabaly it was SLES10 SP2.

I will re-run the benchmark on SLES11 and compare the results again

open64 Version:

x86_open64-4.2.3-1.x86_64.rpm

ACML: latest

 

Last cpuinfo in /proc/cpuinfo:

processor    : 6
vendor_id    : AuthenticAMD
cpu family    : 16
model        : 4
model name    : Quad-Core AMD Opteron(tm) Processor 8382
stepping    : 2
cpu MHz        : 2613.388
cache size    : 512 KB
physical id    : 1
siblings    : 4
core id        : 3
cpu cores    : 4
apicid        : 7
initial apicid    : 7
fpu        : yes
fpu_exception    : yes
cpuid level    : 5
wp        : yes
flags        : fpu vme de pse tsc msr pae cx8 apic sep mtrr pge cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good nopl pni monitor cx16 popcnt lahf_lm abm sse4a 3dnowprefetch ibs sse5
bogomips    : 5229.16
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 48 bits physical, 48 bits virtual
power management:

 

# cat /proc/meminfo


MemTotal:     48173652 kB
MemFree:      45738308 kB
Buffers:          1052 kB
Cached:         713364 kB
SwapCached:          0 kB
Active:        1880124 kB
Inactive:       384952 kB
SwapTotal:     1044216 kB
SwapFree:      1044216 kB
Dirty:              52 kB
Writeback:           0 kB
AnonPages:     1550840 kB
Mapped:          11376 kB
Slab:            30144 kB
SReclaimable:    17416 kB
SUnreclaim:      12728 kB
PageTables:       5224 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
WritebackTmp:        0 kB
CommitLimit:  25131040 kB
Committed_AS:  1685160 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    125924 kB
VmallocChunk: 34359612415 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
HugePages_Surp:      0
Hugepagesize:     2048 kB
DirectMap4k:      8060 kB
DirectMap2M:  49205248 kB

These are the numbers I got:

                                                                                              ss12u1            open64

wupwise_m252288
swim_m419450
mgrid_m404492
applu_m407402
galgel_m271
equake_m115124
apsi_m158242
gafort_m535
fma3d_m418358
art_m134138
ammp_m609668


#!/bin/sh . shrc CORES="7" BENCHMARK=medium CONFIG="3leaf.cfg" OPTIONS="--noreportable --iterations=1 --config=$CONFIG --size=ref" CPUTOTAL=$(grep "^processor" /proc/cpuinfo | tail -1 | awk '{print $NF}') export SUNW_MP_PROCBIND=TRUE export SUNW_MP_THR_IDLE=SPIN export OMP_WAIT_POLICY=active export OMP_STACKSIZE=10M export OMP_NESTED=FALSE export OMP_DYNAMIC=TRUE #export STACKSIZE=16384 #export OMP_NUM_THREADS=16 for core in $CORES do export OMP_NUM_THREADS=${core} export SUNW_MP_PROCBIND="$(( CPUTOTAL - core + 1 ))-${CPUTOTAL}" runspec ${OPTIONS} ${BENCHMARK} done

0 Likes

On SLES 11, here are the results

open64 (4.2.3):

   310.wupwise_m      6000       288     20861*
   312.swim_m            6000       443     13539*
   314.mgrid_m           7300       481     15188*
   316.applu_m           4000       400      9999*
   318.galgel_m                               X
   320.equake_m        2600       124     20898*
   324.apsi_m             3400       243     13975*
   326.gafort_m                               X
   328.fma3d_m          4600       354     12996*
   330.art_m                6400       138     46420*
   332.ammp_m          7000       665     10522*
   Est. SPECompMbase2001                    --
   Est. SPECompMpeak2001                              

ss12u1:

   310.wupwise_m      6000       252     23831*
   312.swim_m            6000       414     14506*
   314.mgrid_m           7300       392     18646*
   316.applu_m           4000       407      9833*
   318.galgel_m           5100       270     18869*
   320.equake_m        2600       114     22773*
   324.apsi_m             3400       161     21156*
   326.gafort_m          8700       537     16201*
   328.fma3d_m          4600       420     10953*
   330.art_m               6400       133     48227*
   332.ammp_m          7000       601     11642*
   Est. SPECompMbase2001                    --
   Est. SPECompMpeak2001 

One difference is that Sun's compiler explicitly allows direct cpu binding. Not sure if that helps or not.                                  

0 Likes

Hello Jkong,

You can set the cpu binding in open64 compiler using "O64_OMP_AFFINITY_MAP" variable. For more information please see page 124 of the "Using the x86 Open64 Compiler Suite" document on following url.

http://developer.amd.com/cpu/open64/assets/x86_open64_user_guide.pdf

Can you please update us on your finding after using the cpu bindings?

 

 

0 Likes

The galgel failure may be because of not setting the slave stack to properly. Can you please set it the following way.

export OMP_SLAVE_STACK_SIZE=22M

0 Likes

Following your suggestion, I set both Variables. One problem I had was that

export OMP_SLAVE_STACK_SIZE=22M

was too big. I changed the value to 2M so wupwise could be started. However, galgel still failed to run. Here are the result before galgel failure

 Success 310.wupwise_m ratio=20785.91, runtime=288.657137
 Success 312.swim_m ratio=13342.47, runtime=449.691920
 Success 314.mgrid_m ratio=14812.13, runtime=492.839398

It doesn't seem that any improvment was made.

0 Likes

1. Issues with galgel and gfort

A . The OMP_SLAVE_STACK_SIZE needs to be set to atleast 10 MB. Otherwise it failed for me.  Can you please try this by setting it atleat to 10MB or more and check?

B. Please set ulimit  to unlimited ie ulimit -s unlimited also.

C. If you still continue to face issues, please do send us the content of the error files, OMPM2001/326.gafort_m/run/00000003/gafort.err and galel.err

 

2. Performance issue:

A. In our initial measurement, Open64 seems to be ~4% better, in the overall geomean. But lags by 10-25% in some individual benchmarks, as you have pointed out.

B. If you get galgel and gafort to work , then please do check the overall geomean and let us know if you see the Open64 gaining overall by ~4%

C. I will be filing a bug report and will let you know once we have resolved this.

 

Thanks

Santosh



0 Likes

May I have your specOMP configuration, or your compiler options? After changing OMP_SLAVE_STACK_SIZE to 22 MB, both gfort and galgel ran through. However, I got ~6% slower result. I also tried that -OPT:early_mp option, did not see much difference. The average ratio is 18452 from Open64 with

Thank you,

Jun

0 Likes

Hi Jun,

I have used the same flags what you had posted here for open64 compiler, listed below. And for Sun, we had used the configuration file you had provided. We did the thread binding also. Our experiments were on 2 sockets, 4 cores per socket and OMP threads were set to 8 on SLES 10 SP2 machine. 

FOPTIMIZE = -Ofast -mp -HP -mso -LNOrefetch=3 -LNO:simd=1 -LNO:vintr=1 -march=barcelona

COPTIMIZE = -Ofast -mp -HP -mso -LNOrefetch=3 -LNO:simd=1 -LNO:vintr=1 -march=barcelona

EXTRA_LIBS= -I/home/szanjurn/acml/open64_64_int64/include -L/home/szanjurn/acml/open64_64_int64/lib -lacml_mv

I have used both numctl and omp env variable for thread binding in the following way.

export O64_OMP_AFFINITY_MAP="7 6 5 3 4 2 1 0"

bind0=numactl --physcpubind=0,1,2,3,4,5,6,7 -l wupwise_base.opencc swim_base.opencc mgrid_base.opencc applu_base.opencc galgel_base.opencc equake_base.opencc apsi_base.opencc gafort_base.opencc fma3d_base.opencc art_base.opencc ammp_base.opencc

Open64 does well for galgel and gafort, do you see similar trends? Can you share the runtime and scores obtained for these two benchmarks? I am assuming the runtime and scores for the rest of the benchmarks are the same as what you had shared before.

We can exchange the config file offline. You can find a private messages waiting for you, on the left side of the Forum web page under "Private Messages" label, when you login to the forum.

santosh


0 Likes

Using your configuration, I ran SPEC OMP again and indeed as you said, overall, open64 is slightly better than ss12u1. In particular, open64 has clear edge on galgel, fm3d and ammp.

Comparing the runs bewteen huge page enabled and disabled, when I run the benchmark with fewer threads and bind them to the same socket,  the results are almost the same. When I use 8 sockets, huge page enabled test is 0.24% slower overall, may be within fluctuation margin.  Not sure if that the huge page is not NUMA sensitive plays any role here.

The hupe page is enabled as such: -HP:heap=2m,bdt=2m

0 Likes