Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Poor clAmdFft performance comparing to MKL and other problems

Hello all.

I'm digging clAmdFft and have a couple of questions about performance and possibly NVIDIA support problem.

Have a look at my example project. It runs a few FFT transforms using MKL and clAmdFft libraries. First test does 2D FFT from real to hermitian_interleaved format of a small matrix (1024 * 729). The second tests do the same but for batch of 16 similar transoforms.

Here is my results (Core i7-2400K vs HD 6950):

SINLGE FFT (1024*729)

OpenCL:  3575 us

SSE:     2050 us

Speedup: 0.573427

DOUBLE FFT (1024*729)

OpenCL:  3875 us

SSE:     2327 us

Speedup: 0.600516

SINLGE FFT BATCH (1024*729 * 16 patches)

OpenCL:  9026 us

SSE:     32646 us

Speedup: 3.61688

DOUBLE FFT BATCH  (1024*729 * 16 patches)

OpenCL:  9784 us

SSE:     42115 us

Speedup: 4.30448

Here is my questions:

1. Why I get so poor performance comparing to MKL? 1024*729 is quite big amount of data. Why my 2000 GFLOPS card cannot beat 100 GFLOPS CPU? Please note: I do not take into account data trasfer time.

2. Unfortunatelly I cannot get this simple project working on NVIDIA GTX460, 465 cards. First test fails because of accuracy errors, and batch test fails because OpenCL error (-36) inside clAmdFft library. Can you confirm that this is NVIDIA driver problem?

3. Can you provide some ideas hove to improve OpenCL performance in the following task: I have a large (5000*5000) double matrix.At each step I cut a small (1024*729) piece from this matrix (pieces can overlap), do some preprocessing (basically element-wise operations), then perform forward FFT transform, then another postprocessing (element-wise operations) and finally backward FFT and put the result into final output matrix. Each step is independent. I thought that both FFT and vector operations are good to execute on GPU, but I cannot get more than 2x speedup comparing MKL implementation. Now I'm thinking about using multiple out-of-order queues to calculate different steps. Is it a right direction?

Thank you

2 Replies

Hi Stepan,

Thank you for reporting your observations on the R-C transforms' performance. The performance results you are showing clearly says the need for optimizations and improvement. I have to mention that we have done very little so far in terms of performance work for the 2D/3D real transforms. The 1D real transforms have some optimizations, but as I said, lot more work needs to be done in 2D/3D area. We will address these in the upcoming releases.

The complex transforms are in a better shape. If you have the option to use complex transforms, I would encourage you to use that. I know that with complex transforms, you would need to allocate more memory. If you can afford to do that, almost all problems needing real transforms can be solved using complex transforms. And in general, power-of-2 sizes are better.

It is good to see that by using batches, you are getting better performance. The power of GPU is in doing a lot of computations simultaneously. It is by throughput and latency hiding. So if you can increase the batch size, it will be even better. Also, you can use multiple queues to do data transfers and computations in parallel.

Thank you, bragadeesh.

From your answer I understand that I should consider complex transforms for better performance.

Do you have any ideas about my NVIDIA problems? What hardware and FFT parameters you use to test the library on NVIDIAs?