Server Gurus Discussions

umarsa · ‎04-05-2023

I recently purchased a DELL Precision with AMD Ryzen Threadripper PRO 5975WX. I am not using multithreading so have 32 cores. My large scientific program is running considerably faster on the same machine when I compile it with the Intel oneapi than aocc 4.0. Is this expected? The program uses openmp in certain parts. I am using the following flags on aocc:

NLIBS = -L${AMD} -lflame -lblis -lfftw3 -lamdlibmfast -lalm
CFLAGS= -Ofast -march=native -mavx2 -fopenmp

and on oneapi

NLIBS = -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -I${MKLROOT}/lib/modules
bslib.a ${MKLROOT}/lib/intel64/libmkl_lapack95_lp64.a
CFLAGS= -free -warn all -nogen-interfaces -Ofast -march=SSE4.2,CORE-AVX2 -qopenmp -I${MKLROOT}/include/intel64/lp64

Thanks!

vtangutu · ‎04-06-2023

Hi umarsa

Thank you for writing to us.
As you mentioned that the application has openmp enabled we can use multi threaded blis(libblis-mt.so) instead of regular blis.
To build multi threaded blis , while configuring add the following option --enable-threading=openmp
you can also add -march=znver3 in your CFLAGS.
Hope this helps to bridge the gap

Best Regards
Hemanth

umarsa · ‎04-09-2023

Hi, it seems like --enable-threading=openmp is an unsupported option according to flang 4.0.

vtangutu · ‎04-09-2023

hello umarsa

Sorry if i was not clear earlier
While configuring blis you need to add --enable-threading=openmp option along with configure command
eg: ./configure --enable-threading=openmp
You can refer to 4.1.1.2 section in AOCL-user guide to build multi threaded blis
(https://www.amd.com/content/dam/amd/en/documents/pdfs/developer/aocl/aocl-v4.0-ga-user-guide.pdf)
If you are using prebuilt binaries please try to link libblis-mt.so in place of libblis.so and you can also try adding -march=znver3 to your CFLAGS.
let me know if you still face issue

Best Regards
Hemanth

umarsa · ‎04-13-2023

Thanks for the response. I did those above but it did not change the timing much. So, for some reason oneapi is considerably faster on this processor. It seems to me that their openmp implementation is working faster as the program is using less system time.

vtangutu · ‎04-14-2023

Hello umarsa

Can you please share the below details so that we can help you better

AOCC/AOCL version:
OneAPI version:
OS version:
lscpu output:
uname -a output:
Application Name:
Compilation Options:
test/benchmark:
Run Parameters:

Thanks
Hemanth

umarsa · ‎04-16-2023

Sure!

AOCC/AOCL version: 4.0.0/4.0

OneAPI version: 2023.1.0

OS version: Fedora 37 with daily updates

lscpu output:

=============================

$ lscpu
Architecture:            x86_64
CPU op-mode(s):        32-bit, 64-bit
Address sizes:         48 bits physical, 48 bits virtual
Byte Order:            Little Endian
CPU(s):                  32
On-line CPU(s) list:   0-31
Vendor ID:               AuthenticAMD
Model name:            AMD Ryzen Threadripper PRO 5975WX 32-Cores
   CPU family:          25
   Model:               8
   Thread(s) per core: 1
   Core(s) per socket: 32
   Socket(s):           1
   Stepping:            2
   Frequency boost:     enabled
   CPU(s) scaling MHz: 28%
   CPU max MHz:         7006.6401
   CPU min MHz:         1800.0000
   BogoMIPS:            7186.33
   Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscal
                        l nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf
                        rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand
                        lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoe
                        xt perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibr
                        s ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sh
                        a_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveer
                        ptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeass
                        ists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid ove
                        rflow_recov succor smca fsrm
Virtualization features:
Virtualization:        AMD-V
Caches (sum of all):
L1d:                   1 MiB (32 instances)
L1i:                   1 MiB (32 instances)
L2:                    16 MiB (32 instances)
L3:                    128 MiB (4 instances)
NUMA:
NUMA node(s):          1
NUMA node0 CPU(s):     0-31
Vulnerabilities:
Itlb multihit:         Not affected
L1tf:                  Not affected
Mds:                   Not affected
Meltdown:              Not affected
Mmio stale data:       Not affected
Retbleed:              Not affected
Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds:                 Not affected
Tsx async abort:       Not affected
===============================

uname -a output:

Linux theory1.phy.vanderbilt.edu 6.2.11-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 13 20:07:32 UTC 2023 x86_64 x86_64 x86_64 GNU
/Linux

Application Name: This is our own scientific code (large code using OpenMP)

Compilation Options AOCC:

CFLAGS= -Ofast -march=znver3 -mavx2 -fopenmp

AMD=/opt/AMD/aocl/aocl-linux-aocc-4.0/lib_LP64

LIBS = -L${AMD} -lflame -lblis-mt -lfftw3_omp -lfftw3 -lamdlibmfast -lalm

Compilation Options OneAPI:

NLIBS = -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -I${MKLROOT}/lib/modules

CFLAGS= -free -warn all  -nogen-interfaces -Ofast -march=SSE4.2,CORE-AVX2 -qopenmp -I${MKLROOT}/include/intel64/lp64

test/benchmark:

OneAPI:

real    127m57.030s
user    3561m20.593s
sys     5m59.888s

AOCC/AOCL:

real    139m35.680s
user    3541m40.271s
sys     15m12.349s
Wed Apr 5 06:48:25 PM CDT 2023
Thu Apr 6 07:16:37 AM CDT 2023

Run Parameters: Same for both

vtangutu · ‎04-17-2023

Hello umarsa

Thank you for sharing all the information.

1) Please try to Profile the application with AMDuProf
The user guide for the same : uprof-v4.0-gaGA-user-guide.pdf

2) Please profile with both AOCC and OneAPI compiler. Generate the report with both the compilers. You can try with TBP , EBP, IBS configurations

3) Compare the data from both the compilers and Identify the functions with large gaps

4) Compare the disassembly in AMDuProf between AOCC and OneAPI compiler and see what's the difference

5)This way you can understand which part of the code needs to be optimized in what way- depending on this different compiler options or machine settings can be tried, or you can modify the code using pragma directives.

6) The data points reported by AMD uProf for various configurations are explained in the AMDuProf User Guide.

Thanks
Hemanth

vtangutu · ‎05-04-2023

hello umarsa

Since we don't have access to the application, we have very limited scope to guide you
Is there anyway that we can assist you?

Best Regards
Hemanth

Server Gurus Discussions

aocc versus oneapi