cancel
Showing results for 
Search instead for 
Did you mean: 

Drivers & Software

j_s
Journeyman III

OpenMP running to slow

Hello,

I compile an OPENMP program:  clang -O3 -march=znver2 -mavx2 diffusion_omp.c -fopenmp -o nur_omp -flto -lm (either with clang from aocc or clang from rhel). Running on an AMD EPYC 7226 (irresp. how many threads - set with OMP_PROC_BIND=TRUE OMP_PLACES=cores OMP_NUM_THREADS=8 ./nur_om) the program is much slower then an old Skylake processor - without OpenMP, the AMD beats Skylake as expected. IF I combine it with MPI, on Intel (8 tasks 2 threads) I will be 20% faster with threads, on the Epyc I'm 3 times slower with threads then without. 

Thanks for any hints,

J_s

 

0 Likes
1 Solution

So far I agree - but I have broken down it to:

running without -fopen option - my code takes 70 sec.

running it with -fopen and 1 Thread - 335 sec.  So this has nothing to do with AVX, or ... it is just a mess. It only appears with aocc (4.0.0). gcc (11.3) works fine.

View solution in original post

0 Likes
2 Replies
unnasha3
Journeyman III

As i know your code is using AVX2 instructions (march=znver2 -mavx2), which can offer significant performance gains on processors that support them. However, the effectiveness of vectorization depends on how well the code is suited for vector operations and how well the compiler can optimize it. screen mirroring realme AMD EPYC processors have a different memory hierarchy and memory bandwidth compared to Intel Skylake processors. If your code is highly memory-bound, the performance may be influenced by the memory characteristics of the processor.

0 Likes

So far I agree - but I have broken down it to:

running without -fopen option - my code takes 70 sec.

running it with -fopen and 1 Thread - 335 sec.  So this has nothing to do with AVX, or ... it is just a mess. It only appears with aocc (4.0.0). gcc (11.3) works fine.

0 Likes