I recently purchased a DELL Precision with AMD Ryzen Threadripper PRO 5975WX. I am not using multithreading so have 32 cores. My large scientific program is running considerably faster on the same machine when I compile it with the Intel oneapi than aocc 4.0. Is this expected? The program uses openmp in certain parts. I am using the following flags on aocc:
NLIBS = -L${AMD} -lflame -lblis -lfftw3 -lamdlibmfast -lalm
CFLAGS= -Ofast -march=native -mavx2 -fopenmp
and on oneapi
NLIBS = -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -I${MKLROOT}/lib/modules
bslib.a ${MKLROOT}/lib/intel64/libmkl_lapack95_lp64.a
CFLAGS= -free -warn all -nogen-interfaces -Ofast -march=SSE4.2,CORE-AVX2 -qopenmp -I${MKLROOT}/include/intel64/lp64
Thanks!
Hi umarsa
Thank you for writing to us.
As you mentioned that the application has openmp enabled we can use multi threaded blis(libblis-mt.so) instead of regular blis.
To build multi threaded blis , while configuring add the following option --enable-threading=openmp
you can also add -march=znver3 in your CFLAGS.
Hope this helps to bridge the gap
Best Regards
Hemanth
Hi, it seems like --enable-threading=openmp is an unsupported option according to flang 4.0.
hello umarsa
Sorry if i was not clear earlier
While configuring blis you need to add --enable-threading=openmp option along with configure command
eg: ./configure --enable-threading=openmp
You can refer to 4.1.1.2 section in AOCL-user guide to build multi threaded blis
(https://www.amd.com/content/dam/amd/en/documents/pdfs/developer/aocl/aocl-v4.0-ga-user-guide.pdf)
If you are using prebuilt binaries please try to link libblis-mt.so in place of libblis.so and you can also try adding -march=znver3 to your CFLAGS.
let me know if you still face issue
Best Regards
Hemanth
Thanks for the response. I did those above but it did not change the timing much. So, for some reason oneapi is considerably faster on this processor. It seems to me that their openmp implementation is working faster as the program is using less system time.
Hello umarsa
Can you please share the below details so that we can help you better
AOCC/AOCL version:
OneAPI version:
OS version:
lscpu output:
uname -a output:
Application Name:
Compilation Options:
test/benchmark:
Run Parameters:
Thanks
Hemanth
Sure!
AOCC/AOCL version: 4.0.0/4.0
OneAPI version: 2023.1.0
OS version: Fedora 37 with daily updates
lscpu output:
=============================
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper PRO 5975WX 32-Cores
CPU family: 25
Model: 8
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
Stepping: 2
Frequency boost: enabled
CPU(s) scaling MHz: 28%
CPU max MHz: 7006.6401
CPU min MHz: 1800.0000
BogoMIPS: 7186.33
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscal
l nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf
rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand
lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoe
xt perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibr
s ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sh
a_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveer
ptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeass
ists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid ove
rflow_recov succor smca fsrm
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 1 MiB (32 instances)
L1i: 1 MiB (32 instances)
L2: 16 MiB (32 instances)
L3: 128 MiB (4 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
===============================
uname -a output:
Linux theory1.phy.vanderbilt.edu 6.2.11-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 13 20:07:32 UTC 2023 x86_64 x86_64 x86_64 GNU
/Linux
Application Name: This is our own scientific code (large code using OpenMP)
Compilation Options AOCC:
CFLAGS= -Ofast -march=znver3 -mavx2 -fopenmp
AMD=/opt/AMD/aocl/aocl-linux-aocc-4.0/lib_LP64
LIBS = -L${AMD} -lflame -lblis-mt -lfftw3_omp -lfftw3 -lamdlibmfast -lalm
Compilation Options OneAPI:
NLIBS = -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -I${MKLROOT}/lib/modules
CFLAGS= -free -warn all -nogen-interfaces -Ofast -march=SSE4.2,CORE-AVX2 -qopenmp -I${MKLROOT}/include/intel64/lp64
test/benchmark:
OneAPI:
real 127m57.030s
user 3561m20.593s
sys 5m59.888s
AOCC/AOCL:
real 139m35.680s
user 3541m40.271s
sys 15m12.349s
Wed Apr 5 06:48:25 PM CDT 2023
Thu Apr 6 07:16:37 AM CDT 2023
Run Parameters: Same for both
Hello umarsa
Thank you for sharing all the information.
1) Please try to Profile the application with AMDuProf
The user guide for the same : uprof-v4.0-gaGA-user-guide.pdf
2) Please profile with both AOCC and OneAPI compiler. Generate the report with both the compilers. You can try with TBP , EBP, IBS configurations
3) Compare the data from both the compilers and Identify the functions with large gaps
4) Compare the disassembly in AMDuProf between AOCC and OneAPI compiler and see what's the difference
5)This way you can understand which part of the code needs to be optimized in what way- depending on this different compiler options or machine settings can be tried, or you can modify the code using pragma directives.
6) The data points reported by AMD uProf for various configurations are explained in the AMDuProf User Guide.
Thanks
Hemanth
hello umarsa
Since we don't have access to the application, we have very limited scope to guide you
Is there anyway that we can assist you?
Best Regards
Hemanth