6 Replies Latest reply on Aug 25, 2014 9:04 PM by kknox

    Disappointing real-time performance OpenCL and HSA

    klopwa

      Hi All,

       

      We have build an test application where we use OpenCL in combination with an AMD A10-7850K APU. The platform is Linux based and the uses a Ubuntu 11.04 distribution with an Xenomai 2.6 patch. The test application implements an FFT of an single 2D 256x256 matrix, the FFT is implemented using the clAmfFFT library. However the average execution time for the FFT lies around the 900 us (CLFFT_REAL, CLFFT_HERMITIAN_INTERLEAVED), approximately the same  as for an Intel i7-4820K using the FFTW library. As such we are wondering if others have experience with performance numbers for a FFT implementation on a AMD A10-7850K APU or comparable platform? And if the achieved performance can be improved or better values can be obtained? Also we are experiencing jitter on the results of up to 10ms, does any of you have experience with reducing this number for the described platform. For an similar Intel based platform we achive jitter values down to 10 us.

       

      Best Regards, 

      Wimar

       


        • Re: Disappointing real-time performance OpenCL and HSA
          Ziple

          Have you tried different sizes? Maybe launching the kernel is the dominant factor? Have you tried with the traditional driver (without HSA)?

            • Re: Disappointing real-time performance OpenCL and HSA
              klopwa

              Hi Ziple,

               

              We have tried several sizes, see the results below. For the first three results there is less then a factor ~4.5 (5*N^2*log2(N^2) ) between the size steps, however for the last step the factor is above ~4.5. showing that indeed for the small sizes the kernel launch is more dominant and load balancing is harder due to the small problem size.

               

              Matrix               APU

              128 x 128          400 us

              256 x 256          900 us

              512 x 512          3200 us

              1024 x 1024      15800 us

               

              With respect to the traditional diver we did not make an attempt, since the AMD A10-7850K especially benefits from its GPU computational power and without the HSA driver the GPU cannot be used as I have understood.

            • Re: Disappointing real-time performance OpenCL and HSA
              kknox

              Did you know that ACML 6 now ships with FFTW interfaces?  It uses clFFT on the backend for GPU compute.  We recently released v6.0.5 which incorporated FFTW speed improvements with zero copy memory.

              http://devgurus.amd.com/thread/169293

               

              Hopefully, it's just a recompile of the existing FFTW code that you wrote. 

                • Re: Disappointing real-time performance OpenCL and HSA
                  klopwa

                  Hi Kknox,

                   

                  Thanks for the tip, however my current application is not directly portable to the suggested library so it will take some time to test. Since we are already using the clAmdFFt library do you expect additional performance gain when we implement it by using the ACML library?

                    • Re: Disappointing real-time performance OpenCL and HSA
                      kknox

                      Hi klopwa,

                       

                      You mentioned in your original post that you had written a timing program for Intel & FFTW.  I was just thinking that you could recompile that to work with ACML; ACML now ships with a FFTW.h file.  All you need to do is link in acml_fftw.so.

                       

                      You mention that you are running on an HSA stack because you are trying to take advantage of HSA features; I assume the shared virtual memory.  When you were benchmarking with clAmdFft, did you allocate your buffers in zero-copy memory?  That could be a cause of the disappointing real-time performance.  The reason that I recommend the ACML 6 acml_fftw library is that all the OpenCL code is hidden behind the FFTW API; our library does the opencl state management.  In v6.0.5, we allocate the OpenCL buffers with zero-copy semantics and our internal benchmarks showed a performance uplift.  If you decide to try it, let me know if you see better performance.

                       

                      Btw, whenever i see a post mention clAmdFft or clAmdBlas, I like to mention that we open sourced those libraries.  You can find them at clMathLibraries.