Hello AMD,
I am using your clAmdFft APP library to do a 32K FFT. Unfortunately performance is not as good as with Apples OpenCL FFT implementation (http://developer.apple.com/library/mac/#samplecode/OpenCL_FFT/Introduction/Intro.html).
I am using the default settings / default parameters to call your API functions. Is there a way to speed things up a bit? Which settings would you recommend?
cheers,
F.
The developers are aware that the performance of release 1.0 of the clAmdFft library isn't as good as other implementations, and they are focusing on improving performance in the next release. Apple's version has been around longer and is a little more mature.
In the meantime, I can only offer general suggestions.
*) Be sure and "bake" the plan ahead of time, and re-use the plan. The kernel generation is an expensive operation, and you don't want to do it redundantly. If your benchmark is including kernel generation time, that will skew your timing results.
*) Use large batch sizes as much as possible. Doing a single 32K FFT is not nearly as efficient as doing a number of them in parallel. If you're only doing FFTs one-at-a-time, most the the compute units in your GPU will be sitting idle with nothing to do.
Hi Frodo~
May I ask what systems you are interested in? OS, bitness, and
CPU's and GPU's you would like to target?
What are your typical workloads?
Originally posted by: kknoxMay I ask what systems you are interested in? OS, bitness, and CPU's and GPU's you would like to target? What are your typical workloads?
- It's running on HD 6970 GPU
- Windows7 x64
- Windows XP 32 would be great too - but I found that the actual driver is not really stable on that OS (with OpenCL)
- The FFTs are size 32K with 10+ batch size
OK this is basically my code. Of course FFT plan is pre-baked and batch size is as large as possible.
Is there a way to speed up FFT and iFFT for 32k FFTs?
clAmdFftPlanHandle plHandle_forward; clAmdFftDim dim = CLFFT_1D; size_t clLengths[ 3 ] = {32768, 1, 1}; clAmdFftCreateDefaultPlan( &plHandle_forward, context, dim, clLengths ); clAmdFftSetPlanBatchSize( plHandle_forward, 16 ); clAmdFftBakePlan( plHandle_forward, 1, &cq, NULL, NULL ); ... // loop to do the FFTs clAmdFftEnqueueTransform( plHandle_forward, CLFFT_FORWARD, 1, &cq, 0, NULL, NULL, &data, NULL, NULL); ...
Originally posted by: FrodoTheGiant OK this is basically my code. Of course FFT plan is pre-baked and batch size is as large as possible.
Is there a way to speed up FFT and iFFT for 32k FFTs?
Could you please send performance number for both?
Did you include BakePlan also in your timing?
Originally posted by: genaganna
Could you please send performance number for both?
Did you include BakePlan also in your timing?
I pre-baked the plan. I measured only the time for the FFT itself clAmdFftEnqueueTransform().
AMDs code is about half as fast as Apples code.
What I've tried:
1) clAmdFftSetPlanPrecision( plHandle_forward, CLFFT_SINGLE_FAST); It seems it doesn't have any affect at all. The code runs exactly at the same speed.
2) Using a pre-defined tmp-buffer gave a speedup of about 10%
clAmdFftEnqueueTransform( plHandle_inverse, CLFFT_BACKWARD, 1, &queue, 0, NULL, NULL, &gpu_in, NULL, tmp_buff);
Is there anything else I could do to speed up AMDs code?
Hey,
1) Frodo can you please throw in some figures regarding the time the fft is taking?
2) Can you please post the code of how you have timed the fft?
I have tested clAmdFft on a batchsize of 64, the fft's have the dimension 2^19, iterated over a 1000 times. That is giving twice as slower performace as CUDA fft library.
Please find the code attached for my timing purposes.
output:
ans=21000
(time_fft_end-time_fft_start)=47.265031
Note: The results are on a single 6950 card.
double time_fft_start=omp_get_wtime(); long long timestart,timeend,ans=0;; for(i=0;i<ITERATION;i++){ ret=clAmdFftEnqueueTransform(plHandle,CLFFT_FORWARD,1,&queue,0,NULL,&event,&clMemBuffersIn,&BuffersOut,clMedBuffer ); ret=clWaitForEvents(1, &event); ret=clGetEventProfilingInfo (event,CL_PROFILING_COMMAND_START,sizeof(long long),×tart,NULL); ret=clGetEventProfilingInfo (event,CL_PROFILING_COMMAND_END,sizeof(long long),&timeend,NULL); ans+=(timeend-timestart)/1000000; }cout << ans << endl; double time_fft_end=omp_get_wtime(); cout << time_fft_end-time_fft_start << endl;