AnsweredAssumed Answered

Poor clAmdFft performance comparing to MKL and other problems

Question asked by stepan.prokipchyn on Jan 14, 2013
Latest reply on Jan 17, 2013 by stepan.prokipchyn

Hello all.


I'm digging clAmdFft and have a couple of questions about performance and possibly NVIDIA support problem.

Have a look at my example project. It runs a few FFT transforms using MKL and clAmdFft libraries. First test does 2D FFT from real to hermitian_interleaved format of a small matrix (1024 * 729). The second tests do the same but for batch of 16 similar transoforms.


Here is my results (Core i7-2400K vs HD 6950):

SINLGE FFT (1024*729)

OpenCL:  3575 us

SSE:     2050 us

Speedup: 0.573427


DOUBLE FFT (1024*729)

OpenCL:  3875 us

SSE:     2327 us

Speedup: 0.600516


SINLGE FFT BATCH (1024*729 * 16 patches)

OpenCL:  9026 us

SSE:     32646 us

Speedup: 3.61688


DOUBLE FFT BATCH  (1024*729 * 16 patches)

OpenCL:  9784 us

SSE:     42115 us

Speedup: 4.30448


Here is my questions:

1. Why I get so poor performance comparing to MKL? 1024*729 is quite big amount of data. Why my 2000 GFLOPS card cannot beat 100 GFLOPS CPU? Please note: I do not take into account data trasfer time.

2. Unfortunatelly I cannot get this simple project working on NVIDIA GTX460, 465 cards. First test fails because of accuracy errors, and batch test fails because OpenCL error (-36) inside clAmdFft library. Can you confirm that this is NVIDIA driver problem?

3. Can you provide some ideas hove to improve OpenCL performance in the following task: I have a large (5000*5000) double matrix.At each step I cut a small (1024*729) piece from this matrix (pieces can overlap), do some preprocessing (basically element-wise operations), then perform forward FFT transform, then another postprocessing (element-wise operations) and finally backward FFT and put the result into final output matrix. Each step is independent. I thought that both FFT and vector operations are good to execute on GPU, but I cannot get more than 2x speedup comparing MKL implementation. Now I'm thinking about using multiple out-of-order queues to calculate different steps. Is it a right direction?


Thank you