I was testing some OpenCL code with CodeXL that includes 3 different clAMDFFT plans which are:
1) 4096 point Real to Complex Planar Forward FFT, out of place, done 1024 times
2) 4096 point Complex to Read Planar Inverse FFT, out of place, done 1024 times
3) 4096 point Complex to Complex Planar FFT, in place, done 1024 times
and get some really long times on the an ATI 7970. The times are as follows:
Is there any way to optimize speed for these kernels? Is out-of-place faster than inplace? Why is the Inverse C->R FFT 1.5 times faster than the Forward R->C FFT? Finally, CodeXL is reporting a 20 for Kernel Usage in CodeXL for the first 2 FFTs, and a 10 for the 3rd FFT. My other Kernels are at 70 to 100, why is the kernel utilization so low?
This is a project to replace a top-of-the-line Intel i7 with a GPU, but the CPU is crushing the GPU at this point.
That being said, thanks to AMD and AMD engineers for putting out this free FFT tool, it is a really nice OpenCL FFT solution that is really easy to use. Thanks!