I have a very simple 3D in-place FFT transform code with FFTW and openmp multi-thread support. I tried to get the best performance in a linux machine (Ubuntu with AMD Genoa CPUs -2 sockets). I built it with AMD compiler, aocc 5.0, and AMD-FFTW (optimized with openmp, avx-512) like
clang++ bench_fftw.cpp -o bench_fftw -fopenmp -march=znver4 -O3 -flto -mavx512 -ffast-math -L/opt/AMD/amd-fftw/lib -lfftw3f_omp -lfftw3f -lm -I/opt/AMD/amd-fftw/include
Regarding how to run it, I typically set
export OMP_NUM_THREADS=8 #for 1 CCD/NUMA
export OMP_PLACES=cores #only using the physical core
export OMP_PROC_BIND=close
I also have a version with MKL FFT interface, it is built with Intel compiler icpx and MKL-FFT.
icpx bench_mkl.cpp -o bench_mkl -qopenmp -O3 -0fast -ffast-math -axCORE-AVX2,CORE-AVX512 -qmkl
The binary built with icpx+mkl-fft performs much better than that with aocc+amd-fftw, almost twice faster.
Any advice on how to tune this code in AMD Genoa?
https://stackoverflow.com/questions/79410148/poor-perf-from-aoccamd-fftw-in-linux-with-amd-genoa-cpu...
gq