Drivers & Software

j_s · ‎07-31-2023

Hello,

I compile an OPENMP program: clang -O3 -march=znver2 -mavx2 diffusion_omp.c -fopenmp -o nur_omp -flto -lm (either with clang from aocc or clang from rhel). Running on an AMD EPYC 7226 (irresp. how many threads - set with OMP_PROC_BIND=TRUE OMP_PLACES=cores OMP_NUM_THREADS=8 ./nur_om) the program is much slower then an old Skylake processor - without OpenMP, the AMD beats Skylake as expected. IF I combine it with MPI, on Intel (8 tasks 2 threads) I will be 20% faster with threads, on the Epyc I'm 3 times slower with threads then without.

Thanks for any hints,

J_s

j_s · ‎08-01-2023

So far I agree - but I have broken down it to:

running without -fopen option - my code takes 70 sec.

running it with -fopen and 1 Thread - 335 sec. So this has nothing to do with AVX, or ... it is just a mess. It only appears with aocc (4.0.0). gcc (11.3) works fine.

View solution in original post

unnasha3 · ‎08-01-2023

As i know your code is using AVX2 instructions (march=znver2 -mavx2), which can offer significant performance gains on processors that support them. However, the effectiveness of vectorization depends on how well the code is suited for vector operations and how well the compiler can optimize it. screen mirroring realme AMD EPYC processors have a different memory hierarchy and memory bandwidth compared to Intel Skylake processors. If your code is highly memory-bound, the performance may be influenced by the memory characteristics of the processor.

j_s · ‎08-01-2023

So far I agree - but I have broken down it to:

running without -fopen option - my code takes 70 sec.

running it with -fopen and 1 Thread - 335 sec. So this has nothing to do with AVX, or ... it is just a mess. It only appears with aocc (4.0.0). gcc (11.3) works fine.

Drivers & Software

OpenMP running to slow