Hey folks,

I was testing some OpenCL code with CodeXL that includes 3 different clAMDFFT plans which are:

1) 4096 point Real to Complex Planar Forward FFT, out of place, done 1024 times

2) 4096 point Complex to Read Planar Inverse FFT, out of place, done 1024 times

3) 4096 point Complex to Complex Planar FFT, in place, done 1024 times

and get some really long times on the an ATI 7970. The times are as follows:

1) ~22ms

2) ~8ms

3) ~75ms

Is there any way to optimize speed for these kernels? Is out-of-place faster than inplace? Why is the Inverse C->R FFT 1.5 times faster than the Forward R->C FFT? Finally, CodeXL is reporting a 20 for Kernel Usage in CodeXL for the first 2 FFTs, and a 10 for the 3rd FFT. My other Kernels are at 70 to 100, why is the kernel utilization so low?

This is a project to replace a top-of-the-line Intel i7 with a GPU, but the CPU is crushing the GPU at this point.

That being said, thanks to AMD and AMD engineers for putting out this free FFT tool, it is a really nice OpenCL FFT solution that is really easy to use. Thanks!

Austin McElroy

I have seen similar performance numbers running the apple OpenCL FFT library on the 7970. An older 6770 easily outperforms the 7970 for smaller problem sizes. I have no idea why this is.

You mention that you perform the FFT 1024 times. Can you batch these FFTs or are you doing that already? Batching appears to be a very effective means for getting the problem size to the point where the 7970 can shine. For large enough batches I'm seeing upwards of 300GFLOP/s.

Cheers,

Dominic