We have build an test application where we use OpenCL in combination with an AMD A10-7850K APU. The platform is Linux based and the uses a Ubuntu 11.04 distribution with an Xenomai 2.6 patch. The test application implements an FFT of an single 2D 256x256 matrix, the FFT is implemented using the clAmfFFT library. However the average execution time for the FFT lies around the 900 us (CLFFT_REAL, CLFFT_HERMITIAN_INTERLEAVED), approximately the same as for an Intel i7-4820K using the FFTW library. As such we are wondering if others have experience with performance numbers for a FFT implementation on a AMD A10-7850K APU or comparable platform? And if the achieved performance can be improved or better values can be obtained? Also we are experiencing jitter on the results of up to 10ms, does any of you have experience with reducing this number for the described platform. For an similar Intel based platform we achive jitter values down to 10 us.
Have you tried different sizes? Maybe launching the kernel is the dominant factor? Have you tried with the traditional driver (without HSA)?
Did you know that ACML 6 now ships with FFTW interfaces? It uses clFFT on the backend for GPU compute. We recently released v6.0.5 which incorporated FFTW speed improvements with zero copy memory.
Hopefully, it's just a recompile of the existing FFTW code that you wrote.
We have tried several sizes, see the results below. For the first three results there is less then a factor ~4.5 (5*N^2*log2(N^2) ) between the size steps, however for the last step the factor is above ~4.5. showing that indeed for the small sizes the kernel launch is more dominant and load balancing is harder due to the small problem size.
128 x 128 400 us
256 x 256 900 us
512 x 512 3200 us
1024 x 1024 15800 us
With respect to the traditional diver we did not make an attempt, since the AMD A10-7850K especially benefits from its GPU computational power and without the HSA driver the GPU cannot be used as I have understood.
Thanks for the tip, however my current application is not directly portable to the suggested library so it will take some time to test. Since we are already using the clAmdFFt library do you expect additional performance gain when we implement it by using the ACML library?
As far as I know you can use OpenCL on your APU GPU without HSA. HSA just brings some new features (and also some bugs, that's why I suggested you to try without it).
You mentioned in your original post that you had written a timing program for Intel & FFTW. I was just thinking that you could recompile that to work with ACML; ACML now ships with a FFTW.h file. All you need to do is link in acml_fftw.so.
You mention that you are running on an HSA stack because you are trying to take advantage of HSA features; I assume the shared virtual memory. When you were benchmarking with clAmdFft, did you allocate your buffers in zero-copy memory? That could be a cause of the disappointing real-time performance. The reason that I recommend the ACML 6 acml_fftw library is that all the OpenCL code is hidden behind the FFTW API; our library does the opencl state management. In v6.0.5, we allocate the OpenCL buffers with zero-copy semantics and our internal benchmarks showed a performance uplift. If you decide to try it, let me know if you see better performance.
Btw, whenever i see a post mention clAmdFft or clAmdBlas, I like to mention that we open sourced those libraries. You can find them at clMathLibraries.