cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

FrodoTheGiant
Journeyman III

clAmdFft : Optimum setup for 32K FFT ?

Hello AMD,

I am using your clAmdFft APP library to do a 32K FFT. Unfortunately performance is not as good as with Apples OpenCL FFT implementation (http://developer.apple.com/library/mac/#samplecode/OpenCL_FFT/Introduction/Intro.html).

I am using the default settings / default parameters to call your API functions. Is there a way to speed things up a bit? Which settings would you recommend?

 

cheers,

F.

 

 

0 Likes
7 Replies
DieInSente
Journeyman III

The developers are aware that the performance of release 1.0 of the clAmdFft library isn't as good as other implementations, and they are focusing on improving performance in the next release.   Apple's version has been around longer and is a little more mature.

In the meantime, I can only offer general suggestions.

*)   Be sure and "bake" the plan ahead of time, and re-use the plan.  The kernel generation is an expensive operation, and you don't want to do it redundantly.  If your benchmark is including kernel generation time, that will skew your timing results.

*)   Use large batch sizes as much as possible.  Doing a single 32K FFT is not nearly as efficient as doing a number of them in parallel.  If you're only doing FFTs one-at-a-time, most the the compute units in your GPU will be sitting idle with nothing to do.

 

0 Likes
kknox
Staff

Hi Frodo~

May I ask what systems you are interested in?  OS, bitness, and
CPU's and GPU's you would like to target?

What are your typical workloads?

0 Likes

Originally posted by: kknoxMay I ask what systems you are interested in?  OS, bitness, and CPU's and GPU's you would like to target?  What are your typical workloads?


- It's running on HD 6970 GPU

- Windows7 x64

- Windows XP 32 would be great too - but I found that the actual driver is not really stable on that OS (with OpenCL)

- The FFTs are size 32K with 10+ batch size

 

0 Likes

OK this is basically my code. Of course FFT plan is pre-baked and batch size is as large as possible.

Is there a way to speed up FFT and iFFT for 32k FFTs?

clAmdFftPlanHandle plHandle_forward; clAmdFftDim dim = CLFFT_1D; size_t clLengths[ 3 ] = {32768, 1, 1}; clAmdFftCreateDefaultPlan( &plHandle_forward, context, dim, clLengths ); clAmdFftSetPlanBatchSize( plHandle_forward, 16 ); clAmdFftBakePlan( plHandle_forward, 1, &cq, NULL, NULL ); ... // loop to do the FFTs clAmdFftEnqueueTransform( plHandle_forward, CLFFT_FORWARD, 1, &cq, 0, NULL, NULL, &data, NULL, NULL); ...

0 Likes

Originally posted by: FrodoTheGiant OK this is basically my code. Of course FFT plan is pre-baked and batch size is as large as possible.

 

Is there a way to speed up FFT and iFFT for 32k FFTs?

 

Could you please send performance number for both?

Did you include BakePlan also in your timing?

 

0 Likes

Originally posted by: genaganna

Could you please send performance number for both?

Did you include BakePlan also in your timing?

 

 

I pre-baked the plan. I measured only the time for the FFT itself  clAmdFftEnqueueTransform().

 

AMDs code is about half as fast as Apples code.

What I've tried:

1) clAmdFftSetPlanPrecision( plHandle_forward, CLFFT_SINGLE_FAST); It seems it doesn't have any affect at all. The code runs exactly at the same speed.

2) Using a pre-defined tmp-buffer gave a speedup of about 10%

clAmdFftEnqueueTransform( plHandle_inverse, CLFFT_BACKWARD, 1, &queue, 0, NULL, NULL, &gpu_in, NULL, tmp_buff);

 

Is there anything else I could do to speed up AMDs code?

0 Likes

Hey,

1) Frodo can you please throw in some figures regarding the time the fft is taking?

2) Can you please post the code of how  you have timed the fft?

I have tested clAmdFft on a batchsize of 64, the fft's have the dimension 2^19, iterated over a 1000 times. That is giving twice as slower performace as CUDA fft library.

Please find the code attached for my timing purposes.

output:
ans=21000

(time_fft_end-time_fft_start)=47.265031

 

Note: The results are on a single 6950 card.

 

double time_fft_start=omp_get_wtime(); long long timestart,timeend,ans=0;; for(i=0;i<ITERATION;i++){ ret=clAmdFftEnqueueTransform(plHandle,CLFFT_FORWARD,1,&queue,0,NULL,&event,&clMemBuffersIn,&BuffersOut,clMedBuffer ); ret=clWaitForEvents(1, &event); ret=clGetEventProfilingInfo (event,CL_PROFILING_COMMAND_START,sizeof(long long),&timestart,NULL); ret=clGetEventProfilingInfo (event,CL_PROFILING_COMMAND_END,sizeof(long long),&timeend,NULL); ans+=(timeend-timestart)/1000000; }cout << ans << endl; double time_fft_end=omp_get_wtime(); cout << time_fft_end-time_fft_start << endl;

0 Likes