Hello all.
I'm digging clAmdFft and have a couple of questions about performance and possibly NVIDIA support problem.
Have a look at my example project. It runs a few FFT transforms using MKL and clAmdFft libraries. First test does 2D FFT from real to hermitian_interleaved format of a small matrix (1024 * 729). The second tests do the same but for batch of 16 similar transoforms.
Here is my results (Core i7-2400K vs HD 6950):
SINLGE FFT (1024*729)
OpenCL: 3575 us
SSE: 2050 us
Speedup: 0.573427
DOUBLE FFT (1024*729)
OpenCL: 3875 us
SSE: 2327 us
Speedup: 0.600516
SINLGE FFT BATCH (1024*729 * 16 patches)
OpenCL: 9026 us
SSE: 32646 us
Speedup: 3.61688
DOUBLE FFT BATCH (1024*729 * 16 patches)
OpenCL: 9784 us
SSE: 42115 us
Speedup: 4.30448
Here is my questions:
1. Why I get so poor performance comparing to MKL? 1024*729 is quite big amount of data. Why my 2000 GFLOPS card cannot beat 100 GFLOPS CPU? Please note: I do not take into account data trasfer time.
2. Unfortunatelly I cannot get this simple project working on NVIDIA GTX460, 465 cards. First test fails because of accuracy errors, and batch test fails because OpenCL error (-36) inside clAmdFft library. Can you confirm that this is NVIDIA driver problem?
3. Can you provide some ideas hove to improve OpenCL performance in the following task: I have a large (5000*5000) double matrix.At each step I cut a small (1024*729) piece from this matrix (pieces can overlap), do some preprocessing (basically element-wise operations), then perform forward FFT transform, then another postprocessing (element-wise operations) and finally backward FFT and put the result into final output matrix. Each step is independent. I thought that both FFT and vector operations are good to execute on GPU, but I cannot get more than 2x speedup comparing MKL implementation. Now I'm thinking about using multiple out-of-order queues to calculate different steps. Is it a right direction?
Thank you