I am using the clAmdFftClient-1.6.244 with the –d and –o options to generate an out of place kernel with a default fft size of 1024, and a single file called clAmdFft.kernel.Stockham1.cl is output. Both the forward and reverse FFTs produce what I expect, and the workgroup size controls the batch (# of FFTs processed per kernel invocation). On a large buffer I measure a throughput of 20000 MB/s (2500 MS/s) on a gpu and a throughput of 560 MB/s (70 MS/s) on 12 cpus. Except for the low performance on the cpus, everything seems to be OK with the 1024 point FFT.
For larger FFTs, starting with 16384, the kernel generator writes two files, clAmdFft.kernel.Stockham2.cl and clAmdFft.kernel.Stockham3.cl. Neither kernel gives the output I expect. I tried operating with one followed by the other in case it is supposed to be a two-stage calculation, but I still did not get a correct answer. Can anyone shed some light on this?
Thanks for reporting your experience using our library. I have certain concerns about the way you are using the library.
Could you provide the reason why you want to dump the kernels and use them directly instead of using our library API? Ability to dump the kernels is supported by the library for informational puposes only and using the kernels directly is not how the library is designed to operate. I hope you went through our reference manual to understand how we would like the users to use our library. If you have not read this, please go through the pdf manual that came with the package, particularly section 1.4.
In your application, you would want to make a series of library API calls as shown below and link in clAmdFft library and let the library compute the transform. The function 'clAmdFftEnqueueTransform' is the function that enqueues kernels for FFT computation.
clAmdFftSetup( ... )
clAmdFftCreateDefaultPlan( ... )
clAmdFftSetPlanPrecision ( ... )
clAmdFftSetResultLocation( ... )
clAmdFftSetLayout( ... )
clAmdFftSetPlanBatchSize( ... )
clAmdFftSetPlanInStride( ... )
clAmdFftSetPlanOutStride( ... )
clAmdFftSetPlanDistance( ... )
clAmdFftSetPlanScale( ... )
clAmdFftBakePlan( ... )
clAmdFftEnqueueTransform( ... )
clFinish( ... )
clAmdFftDestroyPlan( ... )
Please let us know if you have trouble understanding any of our API functions. We value feedback and would be happy to improve our documentation to give the best experience for users.
Thank you for your response. I have two reasons that I wanted to dump the kernels and use them directly. The first is that my OpenCL experience has so far spanned only the use of PyOpenCL. I have built up an infrastructure that has served rather well for investigation and profiling. I have known that in the future, the development of production software would probably require a migration to C or C++, but that point has not yet come for me and OpenCL.
A second reason is more philosophical and political. Let me first say that I have long been an advocate of OpenCL, if only a recent practitioner, and I am particularly excited about the possibilities surrounding AMD's APUs (especially when I saw that Sandia laboratory is already building a supercomputer to use OpenCL on APUs). That being said, I work where CUDA is the GPU programming environment of choice despite my arguments that OpenCL development should be adopted.
From your response, I see that is was never AMD's intention to support use of pre-generated kernels, though I am guessing that there is no technical reason this has to be the case. The need to link to a proprietary library, however, weakens my case for OpenCL. I can already hear the responses: "If we have to link to a proprietary library anyway, we might as well link to CUDA libraries....."
I guess I will now move on to investigate and profile my Plan B choice, the Apple FFT OpenCL kernel.
>The first is that my OpenCL experience has so far spanned only the use of PyOpenCL
We've got the AMD library working with PyOpenCL now. Here's the project page:
If you need any help building the module, you can post something here on this board, or you can take it over to the PyOpenCL mailing list:
>We value feedback
I think you guys should make it so that you don't have to use 13 API calls every time you want to do a batch of GPU work.
I am working on some educational material demonstrating how to use AMD OpenCL with CL/GL interop, and I am using another toolkit for the OpenGL part of it. The way the fft library is designed to be used makes it difficult (impossible?) to use a different interface to the GPU. In other words, if the code in my animation loop for OpenGL has to be the same C/C++ program that the AmdFFT code is compiled with then that eliminates a lot of other frameworks out there that make GL development easier.
You don't have to use 13 API calls every time you wish to process a batch of transforms. The API clAmdFftCreateDefaultPlan() is named such because it provides what we think are reasonable defaults for all plan state. Those defaults are documented, and should represent common values, except for state that is domain specific, like # of dimensions and vector lengths. In addition, once the state is set, it is 'sticky' and you don't need to set it again for the next clAmdFftEnqueueTransform() call. You can copy plans with clAmdFftCopyPlan() if you wish to create similar plans that are slightly different. Finally, once clAmdFftBakePlan() is called, the OpenCL kernel is compiled into bytecode, so the developer can control when the kernel compile happens. The API was designed to support this usage model in a flexible and efficient manner.
As a reminder, unlike BLAS, there is no standard for FFT interfaces. FFTW does exist as an interface that enjoys a lot of developer mindshare, and it employs the same concept of 'plans' that clAmdFft uses. I personally don't feel that the clAmdFft interface is more complex than any other commonly used FFT interface, including FFTW.
Please let me know if I am not answering your concerns correctly.
Ah, I gotcha. I understand how the interface is designed to give you a rich set of options and clearly indicate how features will be added as they are ready. But I didn't mean to emphasize the number of API calls as my main stumbling block.
Is it possible to use the Amd CL FFT library with an application that isn't a C/C++ executable compiled and linked with the library at compile time? I would like to be able to use the AMD FFT library to make a kernel that I can use with the Python OpenCL interface that I am using.
Like the original poster sadrian, I am using PyOpenCL to perform some calculations, and I am trying to use the Pyglet interface to OpenGL,which can work nicely with OpenCL. See for instance:
We unfortunately do not have a python interface to our libraries yet. Since our libraries do export a clean “C” interface, I do believe that it’s possible to create python wrappers to our libraries, but we haven’t had the resources to investigate how to properly do this.
I started a discussion about it on the PyOpenCL forum, and found another guy interested in helping out:
Python wrapper status update:
It's working! For people who are comfortable building Python modules with the Cython utility, it's ready to use:
There are some more things I want to add to the project before we declare that "it's ready" for general consumption:
1) Add a pre-built binary to the repository somehow. I am not sure how geggo (the originator of the project) wants to do that, so I'm not sure how much longer it will be until he thinks we're ready for that step.
2) Add more sample programs showing: a) that is works and b) how to use the library from Python
3) Add at least one sample program that demonstrates properly functioning OpenGL context sharing.
4) Port at least one AMD sample project, perhaps ObjectDetection, to the Python interface as a means of showing other people how to tackle converting other samples. If I can get that working then it means that the common base class for FFT sample programs is already ported and thus other samples should be easy to implement. (right?)
That said, anyone pythonista who feels comfortable compiling his own extension module from the reasonably well documented procedure in the git repo should feel free to go ahead and give it a try.