12 Replies Latest reply on May 30, 2012 4:25 PM by kbrafford

    Problem with large FFTs using clAmdFft

    sadrian

      I am using the clAmdFftClient-1.6.244 with the –d and –o options to generate an out of place kernel with a default fft size of 1024, and a single file called clAmdFft.kernel.Stockham1.cl is output.  Both the forward and reverse FFTs produce what I expect, and the workgroup size controls the batch (# of FFTs processed per kernel invocation). On a large buffer I measure a throughput of 20000 MB/s (2500 MS/s) on a gpu and a throughput of 560 MB/s (70 MS/s) on 12 cpus. Except for the low performance on the cpus, everything seems to be OK with the 1024 point FFT.

       

      For larger FFTs, starting with 16384, the kernel generator writes two files, clAmdFft.kernel.Stockham2.cl and clAmdFft.kernel.Stockham3.cl. Neither kernel gives the output I expect. I tried operating with one followed by the other in case it is supposed to be a two-stage calculation, but I still did not get a correct answer. Can anyone shed some light on this?

       

        • Re: Problem with large FFTs using clAmdFft
          bragadeesh

          Hi sadrian,

          Thanks for reporting your experience using our library. I have certain concerns about the way you are using the library.

           

          Could you provide the reason why you want to dump the kernels and use them directly instead of using our library API? Ability to dump the kernels is supported by the library for informational puposes only and using the kernels directly is not how the library is designed to operate. I hope you went through our reference manual to understand how we would like the users to use our library. If you have not read this, please go through the pdf manual that came with the package, particularly section 1.4.

           

          In your application, you would want to make a series of library API calls as shown below and link in clAmdFft library and let the library compute the transform. The function 'clAmdFftEnqueueTransform' is the function that enqueues kernels for FFT computation.

           

          clAmdFftSetup( ... )

          clAmdFftCreateDefaultPlan( ... )

          clAmdFftSetPlanPrecision ( ... )
          clAmdFftSetResultLocation( ... )
          clAmdFftSetLayout( ... )
          clAmdFftSetPlanBatchSize( ... )
          clAmdFftSetPlanInStride( ... )
          clAmdFftSetPlanOutStride( ... )
          clAmdFftSetPlanDistance( ... )
          clAmdFftSetPlanScale( ... )
          clAmdFftBakePlan( ... )

          clAmdFftEnqueueTransform( ... )
          clFinish( ... )

          ...

          clAmdFftDestroyPlan( ... )

          clAmdFftTeardown()

           

          Please let us know if you have trouble understanding any of our API functions. We value feedback and would be happy to improve our documentation to give the best experience for users.

          1 of 1 people found this helpful
            • Re: Problem with large FFTs using clAmdFft
              sadrian

              Bragadeesh,

               

              Thank you for your response. I have two reasons that I wanted to dump the kernels and use them directly. The first is that my OpenCL experience has so far spanned only the use of PyOpenCL. I have built up an infrastructure that has served rather well for investigation and profiling. I have known that in the future, the development of production software would probably require a migration to C or C++, but that point has not yet come for me and OpenCL.

               

              A second reason is more philosophical and political. Let me first say that I have long been an advocate of OpenCL, if only a recent practitioner, and I am particularly excited about the possibilities surrounding AMD's APUs (especially when I saw that Sandia laboratory is already building a supercomputer to use OpenCL on APUs). That being said, I work where CUDA is the GPU programming environment of choice despite my arguments that OpenCL development should be adopted.

               

              From your response, I see that is was never AMD's intention to support use of pre-generated kernels, though I am guessing that there is no technical reason this has to be the case. The need to link to a proprietary library, however, weakens my case for OpenCL. I can already hear the responses: "If we have to link to a proprietary library anyway, we might as well link to CUDA libraries....."

               

              I guess I will now move on to investigate and profile my Plan B choice, the Apple FFT OpenCL kernel.

              • Re: Problem with large FFTs using clAmdFft
                kbrafford

                >We value feedback

                 

                I think you guys should make it so that you don't have to use 13 API calls every time you want to do a batch of GPU work. 

                 

                I am working on some educational material demonstrating how to use AMD OpenCL with CL/GL interop, and I am using another toolkit for the OpenGL part of it.  The way the fft library is designed to be used makes it difficult (impossible?) to use a different interface to the GPU.  In other words, if the code in my animation loop for OpenGL has to be the same C/C++ program that the AmdFFT code is compiled with then that eliminates a lot of other frameworks out there that make GL development easier.

                  • Re: Problem with large FFTs using clAmdFft
                    kknox

                    Hi kbrafford~

                     

                    You don't have to use 13 API calls every time you wish to process a batch of transforms.  The API clAmdFftCreateDefaultPlan() is named such because it provides what we think are reasonable defaults for all plan state.  Those defaults are documented, and should represent common values, except for state that is domain specific, like # of dimensions and vector lengths.  In addition, once the state is set, it is 'sticky' and you don't need to set it again for the next clAmdFftEnqueueTransform() call.  You can copy plans with clAmdFftCopyPlan() if you wish to create similar plans that are slightly different.  Finally, once clAmdFftBakePlan() is called, the OpenCL kernel is compiled into bytecode, so the developer can control when the kernel compile happens.  The API was designed to support this usage model in a flexible and efficient manner.

                     

                    As a reminder, unlike BLAS, there is no standard for FFT interfaces.  FFTW does exist as an interface that enjoys a lot of developer mindshare, and it employs the same concept of 'plans' that clAmdFft uses.  I personally don't feel that the clAmdFft interface is more complex than any other commonly used FFT interface, including FFTW.

                     

                    Please let me know if I am not answering your concerns correctly.