I try to compare the latest open64 and acml math library with Sun Studio 12 update 1(ss12u1). The result shows open64 generated executables are 10-25% slower than ss12u1. Only on fma3d_m open64 gives better result.
The options I use are:
FOPTIMIZE = -Ofast -mp -HP -mso -LNOrefetch=3 -LNO:simd=1 -LNO:vintr=1 -march=barcelona
COPTIMIZE = -Ofast -mp -HP -mso -LNOrefetch=3 -LNO:simd=1 -LNO:vintr=1 -march=barcelona
EXTRA_LDFLAGS =
EXTRA_LIBS= -I/opt/acml4.3.0/open64_64/include -L/opt/acml4.3.0/open64_64/lib -lacml_mv
Has anyone run specOMP with open64? I searched spec result site and cannot find any result using open64.
Thanks,
Jun
Thank you for letting us know. We are in the process of measuring the SPEC OMP numbers and will get back to you soon.
In the mean time can you share the SPEC config file for SunStudio used for your study?
regards
santosh
Thank you for the reply.
Attached please find my configuration file for using ss12u1. Please ignore the peak section because I am still working on that.
Thanks again,
Jun
[jkong@view ~]$ cat 3leaf.cfg # Invocation command line: # runspec -c 3leaf.cfg --noreportable medium ############################################################################ ############################################################################ # # VENDOR = 3Leaf action = validate tune = base ext = ss12u1 input = ref env_vars = 1 reportable = 1 output_format = asc,config,raw teeout = yes teerunout = yes check_md5 = 1 #mean_anyway = 1 ###### Compiler used ################# default=default: CC=/opt/sun/sunstudio12.1/prod/bin/cc FC=/opt/sun/sunstudio12.1/prod/bin/f90 ######## Portability Flags and Environment variables ################## 318.galgel_m=default=default=default: FPORTABILITY = -e -fixed default=default=default=default: notes41000= Portablility flags: notes41002= 318.galgel_m : -e -fixed notes41004= notes41005= Extra art allowed flags: notes41006= 330.art_m : -DINTS_PER_CACHELINE=16 -DDBLS_PER_CACHELINE=8 notes41012= notes41013= Base and Peak User Environment: notes41014= export OMP_NUM_THREADS=16 notes41016= export SUNW_MP_PROCBIND=TRUE notes41017= export SUNW_MP_THR_IDLE=SPIN notes41018= export OMP_NESTED=FALSE notes41019= export OMP_WAIT_POLICY=active notes41020= export OMP_STACKSIZE=10M notes41021= export OMP_DYNAMIC=TRUE notes41022= ulimit -s unlimited notes41031= notes41036= Default BIOS settings used. notes41037= #################### SPEC OMPM2001 Portability flags ################# 330.art_m=default=default=default: EXTRA_CFLAGS = -DINTS_PER_CACHELINE=16 -DDBLS_PER_CACHELINE=8 #################### Baseline Optimization Flags ###################### medium=base=default=default: #FOPTIMIZE = -fast -xarch=generic -xautopar -xopenmp -xipo=2 -xvector=lib -m64 -aligncommon=16 -fns=no FOPTIMIZE = -fast -xarch=generic -xautopar -xopenmp -xipo=2 -m64 COPTIMIZE = -fast -fns=no -xalias_level=layout -xdepend=no -m64 -xopenmp EXTRA_LDFLAGS = EXTRA_LIBS= ONESTEP=yes default=default=default=default: notes121 = Compiler Invocation: notes122 = C : cc notes123 = F90 : f90 notes124 = F77 : f90 notes125 = notes126 = Base tuning: notes127 = Fortran : -fast -xarch=generic -xautopar -xopenmp -xipo=2 -xvector=lib notes128 = -m64 -aligncommon=16 -fns=no notes129 = C : -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 -m64 notes130 = ONESTEP=yes ######################### Peak Flags ############################# medium=peak=default=default: ONESTEP = yes notes300_0 = notes300_1 = Peak tuning: notes300_2 = ONESTEP=yes for all peak tests. notes300_3 = 310.wupwise_m=peak=default=default: #ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 #ENV_OMP_NUM_THREADS=8 OPTIMIZE = -fast -xarch=generic -xautopar -xopenmp -xipo=2 -xprefetch -xprefetch_level=3 -m64 notes310_1 = 310.wupwise_m : -fast -xarch=generic -xautopar -xopenmp -xipo=2 notes310_2 = -xvector=lib -xprefetch -xprefetch_level=3 -m64 notes310_3 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes310_4 = ENV_OMP_NUM_THREADS=8 312.swim_m=peak=default=default: srcalt=ompl.32 OPTIMIZE = -fast -xmodel=medium -Qoption ube -fsimple=3 -xipo=2 -m64 -xvector=simd -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 notes312_1 = 312.swim_m : -fast -xmodel=medium -Qoption ube -fsimple=3 -xipo=2 -m64 notes312_2 = -xvector=simd -xopenmp notes312_3 = srcalt = ompl.32 notes312_4 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes312_5 = ENV_OMP_NUM_THREADS=8 314.mgrid_m=peak=default=default: OPTIMIZE = -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 -m64 ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 notes314_1 = 314.mgrid_m : -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 -m64 notes314_2 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes314_3 = ENV_OMP_NUM_THREADS=8 316.applu_m=peak=default=default: OPTIMIZE = -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 -m64 -xmodel=medium srcalt=ompl ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 notes316_1 = 316.applu_m : -fast -xopenmp -xautopar -xipo=2 -xvector=lib -xprefetch -xprefetch_level=3 notes316_2 = -m64 -xmodel=medium notes316_3 = srcalt = ompl notes316_4 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes316_5 = ENV_OMP_NUM_THREADS=8 318.galgel_m=peak=default=default: OPTIMIZE = -O3 -xpagesize=2M -xipo=2 -xvector=simd -m64 -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 fdo_pre0 = rm -rf ./feedback.profile fdo_post1 = if [ ! -d ./feedback.profile ]; then exit 1; fi PASS1_FFLAGS = -xprofile=collect:./feedback PASS2_FFLAGS = -xprofile=use:./feedback PASS1_LDFLAGS = -xprofile=collect:./feedback PASS2_LDFLAGS = -xprofile=use:./feedback EXTRA_LIBS = -xlic_lib=sunperf RM_SOURCES = lapak.f90 notes318_1 = 318.galgel_m : -O3 -xpagesize=2M -xipo=2 -xvector=simd -m64 -xopenmp notes318_2 = -xlic_lib=sunperf +FDO notes318_3 = RM_SOURCES=lapak.f90 notes318_4 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes318_5 = ENV_OMP_NUM_THREADS=8 320.equake_m=peak=default=default: OPTIMIZE = -fast -fns=no -xalias_level=layout -xdepend=no -m64 -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 srcalt=ompl.32 notes320_1 = 320.equake_m : -fast -fns=no -xalias_level=layout -xdepend=no -m64 -xopenmp notes320_2 = srcalt = ompl.32 notes320_3 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes320_4 = ENV_OMP_NUM_THREADS=8 324.apsi_m=peak=default=default: OPTIMIZE = -fast -xipo=2 -m64 -xprefetch_level=3 -xvector -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 srcalt=ompl.32 notes324_1 = 324.apsi_m : -fast -xipo=2 -m64 -xprefetch_level=3 -xvector -xopenmp notes324_2 = srcalt = ompl.32 notes324_3 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes324_4 = ENV_OMP_NUM_THREADS=8 326.gafort_m=peak=default=default: OPTIMIZE = -fast -fstore -aligncommon=8 -xpagesize=2M -m64 -xipo=2 -xvector=simd -xopenmp notes326_1 = 326.gafort_m : -fast -fstore -aligncommon=8 -xpagesize=2M -m64 notes326_2 = -xipo=2 -xvector=simd -xopenmp 328.fma3d_m=peak=default=default: FOPTIMIZE = -fast -xipo=2 -m64 -xvector=simd -xopenmp srcalt=ompl.32 notes328_1 = 328.fma3d_m : -fast -xipo=2 -m64 -xvector=simd -xopenmp notes328_2 = srcalt = ompl.32 330.art_m=peak=default=default: OPTIMIZE = -fast -xipo=2 -m64 -xopenmp ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 ENV_OMP_NUM_THREADS=8 notes330_1 = 330.art_m : -fast -xipo=2 -m64 -xopenmp notes330_2 = ENV_SUNW_MP_PROCBIND=8 10 12 14 9 11 13 15 notes330_3 = ENV_OMP_NUM_THREADS=8 332.ammp_m=peak=default=default: OPTIMIZE = -fast -xautopar -xipo=2 -xvector=simd -m64 -Wu,-fsimple=3 -xopenmp -xpagesize=2M notes332_1 = 332.ammp_m : -fast -xautopar -xipo=2 -xvector=simd -m64 -Wu,-fsimple=3 -xopenmp notes332_2 = -xpagesize=2M # # machine configuration # hw_vendor = Supermicro hw_model = hw_cpu = AMD hw_cpu_mhz = hw_fpu = hw_ncpu = hw_ncpuorder= hw_pcache = hw_scache = hw_tcache = hw_ocache = hw_memory = hw_disk = hw_avail = hw_other = sw_os = sw_compiler = Sun Studio Compiler sw_Kernel_Extensions = None sw_file = xfs sw_state = Multi-User sw_avail = Jun-2009 sw_parallel = OpenMP and Automatic parallel license_num = tester_name = 3Leaf test_date = Dec-2009 test_site = iSanta Clara company_name= 3Leaf Systems machine_name= prepared_by =
Hello jkong,
Can you please share following details,
• Machine configuration ( /porc/cpuinfo and meminfo)
• OS and its version used
• OMP_NUM_THREADS variable is not used in your config file for base run. Are you using it in any other way?
• Actual results or atleast the overall geomean scores, if it can be shared for comparison purposes.
• Open64 Compiler version
regards,
Santosh
Please see the attached script for my test run. The OMP_NUM_THREADS is set in that script. The reason is because I would like to see the scalability as my machine grows from 4 to 31 cores
In this particular run, I used 7 CPUs from a two socket machine.
My actual result is at the bottom.
Other info:
OS:
Since the OS was reinstalled, I don't remember exactly which version I installed. Probabaly it was SLES10 SP2.
I will re-run the benchmark on SLES11 and compare the results again
open64 Version:
x86_open64-4.2.3-1.x86_64.rpm
ACML: latest
Last cpuinfo in /proc/cpuinfo:
processor : 6
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : Quad-Core AMD Opteron(tm) Processor 8382
stepping : 2
cpu MHz : 2613.388
cache size : 512 KB
physical id : 1
siblings : 4
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae cx8 apic sep mtrr pge cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow rep_good nopl pni monitor cx16 popcnt lahf_lm abm sse4a 3dnowprefetch ibs sse5
bogomips : 5229.16
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management:
# cat /proc/meminfo
MemTotal: 48173652 kB
MemFree: 45738308 kB
Buffers: 1052 kB
Cached: 713364 kB
SwapCached: 0 kB
Active: 1880124 kB
Inactive: 384952 kB
SwapTotal: 1044216 kB
SwapFree: 1044216 kB
Dirty: 52 kB
Writeback: 0 kB
AnonPages: 1550840 kB
Mapped: 11376 kB
Slab: 30144 kB
SReclaimable: 17416 kB
SUnreclaim: 12728 kB
PageTables: 5224 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 25131040 kB
Committed_AS: 1685160 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 125924 kB
VmallocChunk: 34359612415 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 8060 kB
DirectMap2M: 49205248 kB
These are the numbers I got:
ss12u1 open64
wupwise_m | 252 | 288 |
swim_m | 419 | 450 |
mgrid_m | 404 | 492 |
applu_m | 407 | 402 |
galgel_m | 271 | |
equake_m | 115 | 124 |
apsi_m | 158 | 242 |
gafort_m | 535 | |
fma3d_m | 418 | 358 |
art_m | 134 | 138 |
ammp_m | 609 | 668 |
#!/bin/sh . shrc CORES="7" BENCHMARK=medium CONFIG="3leaf.cfg" OPTIONS="--noreportable --iterations=1 --config=$CONFIG --size=ref" CPUTOTAL=$(grep "^processor" /proc/cpuinfo | tail -1 | awk '{print $NF}') export SUNW_MP_PROCBIND=TRUE export SUNW_MP_THR_IDLE=SPIN export OMP_WAIT_POLICY=active export OMP_STACKSIZE=10M export OMP_NESTED=FALSE export OMP_DYNAMIC=TRUE #export STACKSIZE=16384 #export OMP_NUM_THREADS=16 for core in $CORES do export OMP_NUM_THREADS=${core} export SUNW_MP_PROCBIND="$(( CPUTOTAL - core + 1 ))-${CPUTOTAL}" runspec ${OPTIONS} ${BENCHMARK} done
On SLES 11, here are the results
open64 (4.2.3):
310.wupwise_m 6000 288 20861*
312.swim_m 6000 443 13539*
314.mgrid_m 7300 481 15188*
316.applu_m 4000 400 9999*
318.galgel_m X
320.equake_m 2600 124 20898*
324.apsi_m 3400 243 13975*
326.gafort_m X
328.fma3d_m 4600 354 12996*
330.art_m 6400 138 46420*
332.ammp_m 7000 665 10522*
Est. SPECompMbase2001 --
Est. SPECompMpeak2001
ss12u1:
310.wupwise_m 6000 252 23831*
312.swim_m 6000 414 14506*
314.mgrid_m 7300 392 18646*
316.applu_m 4000 407 9833*
318.galgel_m 5100 270 18869*
320.equake_m 2600 114 22773*
324.apsi_m 3400 161 21156*
326.gafort_m 8700 537 16201*
328.fma3d_m 4600 420 10953*
330.art_m 6400 133 48227*
332.ammp_m 7000 601 11642*
Est. SPECompMbase2001 --
Est. SPECompMpeak2001
One difference is that Sun's compiler explicitly allows direct cpu binding. Not sure if that helps or not.
Hello Jkong,
You can set the cpu binding in open64 compiler using "O64_OMP_AFFINITY_MAP" variable. For more information please see page 124 of the "Using the x86 Open64 Compiler Suite" document on following url.
http://developer.amd.com/cpu/open64/assets/x86_open64_user_guide.pdf
Can you please update us on your finding after using the cpu bindings?
The galgel failure may be because of not setting the slave stack to properly. Can you please set it the following way.
export OMP_SLAVE_STACK_SIZE=22M
Following your suggestion, I set both Variables. One problem I had was that
export OMP_SLAVE_STACK_SIZE=22M
was too big. I changed the value to 2M so wupwise could be started. However, galgel still failed to run. Here are the result before galgel failure
Success 310.wupwise_m ratio=20785.91, runtime=288.657137
Success 312.swim_m ratio=13342.47, runtime=449.691920
Success 314.mgrid_m ratio=14812.13, runtime=492.839398
It doesn't seem that any improvment was made.
1. Issues with galgel and gfort
A . The OMP_SLAVE_STACK_SIZE needs to be set to atleast 10 MB. Otherwise it failed for me. Can you please try this by setting it atleat to 10MB or more and check?
B. Please set ulimit to unlimited ie ulimit -s unlimited also.
C. If you still continue to face issues, please do send us the content of the error files, OMPM2001/326.gafort_m/run/00000003/gafort.err and galel.err
2. Performance issue:
A. In our initial measurement, Open64 seems to be ~4% better, in the overall geomean. But lags by 10-25% in some individual benchmarks, as you have pointed out.
B. If you get galgel and gafort to work , then please do check the overall geomean and let us know if you see the Open64 gaining overall by ~4%
C. I will be filing a bug report and will let you know once we have resolved this.
Thanks
Santosh
May I have your specOMP configuration, or your compiler options? After changing OMP_SLAVE_STACK_SIZE to 22 MB, both gfort and galgel ran through. However, I got ~6% slower result. I also tried that -OPT:early_mp option, did not see much difference. The average ratio is 18452 from Open64 with
Thank you,
Jun
Hi Jun,
I have used the same flags what you had posted here for open64 compiler, listed below. And for Sun, we had used the configuration file you had provided. We did the thread binding also. Our experiments were on 2 sockets, 4 cores per socket and OMP threads were set to 8 on SLES 10 SP2 machine.
FOPTIMIZE = -Ofast -mp -HP -mso -LNOrefetch=3 -LNO:simd=1 -LNO:vintr=1 -march=barcelona
COPTIMIZE = -Ofast -mp -HP -mso -LNOrefetch=3 -LNO:simd=1 -LNO:vintr=1 -march=barcelona
EXTRA_LIBS= -I/home/szanjurn/acml/open64_64_int64/include -L/home/szanjurn/acml/open64_64_int64/lib -lacml_mv
I have used both numctl and omp env variable for thread binding in the following way.
export O64_OMP_AFFINITY_MAP="7 6 5 3 4 2 1 0"
bind0=numactl --physcpubind=0,1,2,3,4,5,6,7 -l wupwise_base.opencc swim_base.opencc mgrid_base.opencc applu_base.opencc galgel_base.opencc equake_base.opencc apsi_base.opencc gafort_base.opencc fma3d_base.opencc art_base.opencc ammp_base.opencc
Open64 does well for galgel and gafort, do you see similar trends? Can you share the runtime and scores obtained for these two benchmarks? I am assuming the runtime and scores for the rest of the benchmarks are the same as what you had shared before.
We can exchange the config file offline. You can find a private messages waiting for you, on the left side of the Forum web page under "Private Messages" label, when you login to the forum.
Using your configuration, I ran SPEC OMP again and indeed as you said, overall, open64 is slightly better than ss12u1. In particular, open64 has clear edge on galgel, fm3d and ammp.
Comparing the runs bewteen huge page enabled and disabled, when I run the benchmark with fewer threads and bind them to the same socket, the results are almost the same. When I use 8 sockets, huge page enabled test is 0.24% slower overall, may be within fluctuation margin. Not sure if that the huge page is not NUMA sensitive plays any role here.
The hupe page is enabled as such: -HP:heap=2m,bdt=2m