7 Replies Latest reply on Jun 7, 2011 6:58 PM by divij

    clAmdFft : Optimum setup for 32K FFT ?

    FrodoTheGiant

      Hello AMD,

      I am using your clAmdFft APP library to do a 32K FFT. Unfortunately performance is not as good as with Apples OpenCL FFT implementation (http://developer.apple.com/library/mac/#samplecode/OpenCL_FFT/Introduction/Intro.html).

      I am using the default settings / default parameters to call your API functions. Is there a way to speed things up a bit? Which settings would you recommend?

       

      cheers,

      F.

       

       

        • clAmdFft : Optimum setup for 32K FFT ?
          DieInSente

          The developers are aware that the performance of release 1.0 of the clAmdFft library isn't as good as other implementations, and they are focusing on improving performance in the next release.   Apple's version has been around longer and is a little more mature.

          In the meantime, I can only offer general suggestions.

          *)   Be sure and "bake" the plan ahead of time, and re-use the plan.  The kernel generation is an expensive operation, and you don't want to do it redundantly.  If your benchmark is including kernel generation time, that will skew your timing results.

          *)   Use large batch sizes as much as possible.  Doing a single 32K FFT is not nearly as efficient as doing a number of them in parallel.  If you're only doing FFTs one-at-a-time, most the the compute units in your GPU will be sitting idle with nothing to do.

           

          • clAmdFft : Optimum setup for 32K FFT ?
            kknox

            Hi Frodo~

            May I ask what systems you are interested in?  OS, bitness, and
            CPU's and GPU's you would like to target?

            What are your typical workloads?

              • clAmdFft : Optimum setup for 32K FFT ?
                FrodoTheGiant

                 

                Originally posted by: kknoxMay I ask what systems you are interested in?  OS, bitness, and CPU's and GPU's you would like to target?  What are your typical workloads?


                - It's running on HD 6970 GPU

                - Windows7 x64

                - Windows XP 32 would be great too - but I found that the actual driver is not really stable on that OS (with OpenCL)

                - The FFTs are size 32K with 10+ batch size

                 

                  • clAmdFft : Optimum setup for 32K FFT ?
                    FrodoTheGiant

                    OK this is basically my code. Of course FFT plan is pre-baked and batch size is as large as possible.

                    Is there a way to speed up FFT and iFFT for 32k FFTs?

                    clAmdFftPlanHandle plHandle_forward; clAmdFftDim dim = CLFFT_1D; size_t clLengths[ 3 ] = {32768, 1, 1}; clAmdFftCreateDefaultPlan( &plHandle_forward, context, dim, clLengths ); clAmdFftSetPlanBatchSize( plHandle_forward, 16 ); clAmdFftBakePlan( plHandle_forward, 1, &cq, NULL, NULL ); ... // loop to do the FFTs clAmdFftEnqueueTransform( plHandle_forward, CLFFT_FORWARD, 1, &cq, 0, NULL, NULL, &data, NULL, NULL); ...

                      • clAmdFft : Optimum setup for 32K FFT ?
                        genaganna

                         

                        Originally posted by: FrodoTheGiant OK this is basically my code. Of course FFT plan is pre-baked and batch size is as large as possible.

                         

                        Is there a way to speed up FFT and iFFT for 32k FFTs?

                         

                        Could you please send performance number for both?

                        Did you include BakePlan also in your timing?

                         

                          • clAmdFft : Optimum setup for 32K FFT ?
                            FrodoTheGiant

                             

                            Originally posted by: genaganna

                            Could you please send performance number for both?

                            Did you include BakePlan also in your timing?

                             

                             

                            I pre-baked the plan. I measured only the time for the FFT itself  clAmdFftEnqueueTransform().

                             

                            AMDs code is about half as fast as Apples code.

                            What I've tried:

                            1) clAmdFftSetPlanPrecision( plHandle_forward, CLFFT_SINGLE_FAST); It seems it doesn't have any affect at all. The code runs exactly at the same speed.

                            2) Using a pre-defined tmp-buffer gave a speedup of about 10%

                            clAmdFftEnqueueTransform( plHandle_inverse, CLFFT_BACKWARD, 1, &queue, 0, NULL, NULL, &gpu_in, NULL, tmp_buff);

                             

                            Is there anything else I could do to speed up AMDs code?

                              • clAmdFft : Optimum setup for 32K FFT ?
                                divij

                                Hey,

                                1) Frodo can you please throw in some figures regarding the time the fft is taking?

                                2) Can you please post the code of how  you have timed the fft?

                                I have tested clAmdFft on a batchsize of 64, the fft's have the dimension 2^19, iterated over a 1000 times. That is giving twice as slower performace as CUDA fft library.

                                Please find the code attached for my timing purposes.

                                output:
                                ans=21000

                                (time_fft_end-time_fft_start)=47.265031

                                 

                                Note: The results are on a single 6950 card.

                                 

                                double time_fft_start=omp_get_wtime(); long long timestart,timeend,ans=0;; for(i=0;i<ITERATION;i++){ ret=clAmdFftEnqueueTransform(plHandle,CLFFT_FORWARD,1,&queue,0,NULL,&event,&clMemBuffersIn,&BuffersOut,clMedBuffer ); ret=clWaitForEvents(1, &event); ret=clGetEventProfilingInfo (event,CL_PROFILING_COMMAND_START,sizeof(long long),&timestart,NULL); ret=clGetEventProfilingInfo (event,CL_PROFILING_COMMAND_END,sizeof(long long),&timeend,NULL); ans+=(timeend-timestart)/1000000; }cout << ans << endl; double time_fft_end=omp_get_wtime(); cout << time_fft_end-time_fft_start << endl;