cancel
Showing results for 
Search instead for 
Did you mean: 

General Discussions

readonly
Journeyman III

Poor perf from aocc+amd-fftw in linux with AMD Genoa CPU (compared to built with Intel icpx+mkl)

I have a very simple 3D in-place FFT transform code with FFTW and openmp multi-thread support. I tried to get the best performance in a linux machine (Ubuntu with AMD Genoa CPUs -2 sockets). I built it with AMD compiler, aocc 5.0, and AMD-FFTW (optimized with openmp, avx-512) like

clang++ bench_fftw.cpp -o bench_fftw -fopenmp -march=znver4 -O3 -flto -mavx512 -ffast-math -L/opt/AMD/amd-fftw/lib -lfftw3f_omp -lfftw3f -lm -I/opt/AMD/amd-fftw/include

 

Regarding how to run it, I typically set

 

export OMP_NUM_THREADS=8 #for 1 CCD/NUMA

export OMP_PLACES=cores #only using the physical core

export OMP_PROC_BIND=close

I also have a version with MKL FFT interface, it is built with Intel compiler icpx and MKL-FFT.

icpx bench_mkl.cpp -o bench_mkl -qopenmp -O3 -0fast -ffast-math -axCORE-AVX2,CORE-AVX512 -qmkl

The binary built with icpx+mkl-fft performs much better than that with aocc+amd-fftw, almost twice faster.

 

Any advice on how to tune this code in AMD Genoa?

 

https://stackoverflow.com/questions/79410148/poor-perf-from-aoccamd-fftw-in-linux-with-amd-genoa-cpu...

 

  gq

 

0 Likes
3 Replies
ajayrant
Staff

Hi @readonly ,

 

Thanks for writing to serverguru forum

Sorry for the delay in response.

Currently we are investigating your issue at our end. we will keep you updated about the same.

 

Thanks & Regards

Ajay

0 Likes
RookieSideloader
Journeyman III

To optimize FFT performance on AMD Genoa CPUs with AOCC and AMD-FFTW, try adjusting compiler flags by removing -mavx512 and using -march=native or -mavx2. Utilize FFTW wisdom for tuning and ensure FFTW uses AVX2 or AVX-512 efficiently. Optimize OpenMP settings by experimenting with OMP_PLACES=threads and proper NUMA policies. Consider testing MKL-FFT with AOCC or using rocFFT for potential gains. Profiling with tools like perf or AMD uProf can also help identify performance bottlenecks.

0 Likes
shrjoshi
Staff

Thank you for sharing the test case, compilation and run steps.

We are able to reproduce the issue at our end.

We are currently working on fixing this issue and will let you know once the fix is available.

0 Likes