Archives Discussions

amcelroy · ‎03-15-2013

Hey folks,

I was testing some OpenCL code with CodeXL that includes 3 different clAMDFFT plans which are:

1) 4096 point Real to Complex Planar Forward FFT, out of place, done 1024 times

2) 4096 point Complex to Read Planar Inverse FFT, out of place, done 1024 times

3) 4096 point Complex to Complex Planar FFT, in place, done 1024 times

and get some really long times on the an ATI 7970. The times are as follows:

1) ~22ms

2) ~8ms

3) ~75ms

Is there any way to optimize speed for these kernels? Is out-of-place faster than inplace? Why is the Inverse C->R FFT 1.5 times faster than the Forward R->C FFT? Finally, CodeXL is reporting a 20 for Kernel Usage in CodeXL for the first 2 FFTs, and a 10 for the 3rd FFT. My other Kernels are at 70 to 100, why is the kernel utilization so low?

This is a project to replace a top-of-the-line Intel i7 with a GPU, but the CPU is crushing the GPU at this point.

That being said, thanks to AMD and AMD engineers for putting out this free FFT tool, it is a really nice OpenCL FFT solution that is really easy to use. Thanks!

Austin McElroy

dmeiser · ‎03-15-2013

I have seen similar performance numbers running the apple OpenCL FFT library on the 7970. An older 6770 easily outperforms the 7970 for smaller problem sizes. I have no idea why this is.

You mention that you perform the FFT 1024 times. Can you batch these FFTs or are you doing that already? Batching appears to be a very effective means for getting the problem size to the point where the 7970 can shine. For large enough batches I'm seeing upwards of 300GFLOP/s.

Cheers,

Dominic

amcelroy · ‎03-15-2013

Hey Dominic,

Yes, these are all Batched. How did you overcome the problem on the 7970, just throw more data at it? What speeds were you seeing on the 6770?

Thanks for mentioning that the Apple OpenCL FFT runs similarly on the 7970, we were about to chase that rabbit!

Austin

dmeiser · ‎03-15-2013

Yes, these are all Batched. How did you overcome the problem on the 7970, just throw more data at it? What speeds were you seeing on the 6770?

Yes, just using larger problem sizes. I was interested in 256x256 transforms and for batches of 128 I was seeing right around 330GFLOP/s. For comparison on the 6770 I was seeing 120GFLOP/s for that problem size.

bragadeesh · ‎03-15-2013

Hi Austin,

In the time measurements, are you including the time taken for roundtrip data copies as well? I mean data transfer from system memory to device memory and back? When you say comparison with Intel CPU, what exactly are you running? is it MKL? what are the observed times with that? and how many CPU cores does your system have?

Is this an evaluation for a large project? How many GPUs are we talking about?

amcelroy · ‎03-15-2013

Hey Bragadeesh,

These numbers are just what CodeXL displays for that given kernel, if that answers you question. CodeXL probably isn't taking into account the time to read and write the dataset to memory.

The CPU is going to be an 8 core i7 (not sure about the exact specs), but it is taking < 50ms to do the same FFT operations on the i7. The FFT library is FFTW.

So far, the plan is only for 1 GPU per system. The project is for a commercial laser imaging system, so this definitely isn't a one-off.

Thanks,

Austin

amcelroy · ‎03-20-2013

Hey Bragadeesh,

Sorry to be a gadfly, have you been able to replicate the problem? I'd be happy to run test code or assist in any way.

Thanks,

Austin

bragadeesh · ‎03-21-2013

Hi Austin,

CodeXL is nice tool. But for kernel performance, I would like to see the time measurements made directly by using system timers and/or using GPU OpenCL events and profiling enabled command queues. On my side, I measured performance of these real FFTs and they are comparable to the complex FFTs. We need to understand what is it that you are having trouble with:

1. Copy data from system memory to GPU

2. FFT kernel execution

3. Copy data from GPU to system memory

The step 2 running time can be measured accurately by programming with OpenCL events. You can also write code to time the other steps to better understand the relative times. As far as I can tell, the kernel performance is reasonable as measured on my side.

amcelroy · ‎03-21-2013

Hey Bragadeesh,

Thanks for investigating. What times are you getting for a 4096 Complex to Complex planar FFT? The CodeXL times are very similar to the system timer times.

Also, why is the R->C slower than the C->R (see the first post for times)?

Thanks again for your time,

Austin

amcelroy · ‎03-24-2013

Just an update, I've tried an out of place 4096 C->C planar FFT and it is still in the realm of ~80ms.

Using Intel Performance Primitive to run through the same code on an i5 for the 3 FFTs in the original post is takes around 80ms, so there still seems to be a large performance gap.

Any help would be greatly appreciated.

bragadeesh · ‎03-25-2013

Generally, the C->R transform is slightly slower than the R->C transform. In my measurements, I never saw the other way around.

The time I measured for complex planar 4096-point FFT for 1024 transforms (batch size 1024) is less than 1 ms. This is just kernel computation time. When you say time, what exactly are you measuring?

amcelroy · ‎03-26-2013

Wow!!! < 1ms is way better than I would have hoped for 4096 by a batch of 1024. The times reported in the first post are the time required to:

1.) Create a temp buffer (clAmdFftGetTmpBufSize)

2.) Allocate the temp buffer (it isn't allocated for the sizes given in the first post)

3.) Enqueue the transform (clAmdFftEnqueueTransform)

4.) Wait for the queue to finish with clFinish( )

Thanks for posting your results. I going to mess around with the demo code and see if that helps. The FFT has all the correct answers when compared against IPP and Labview, just the times are slow.

Austin

amcelroy · ‎03-27-2013

Ok, after endlessly staring and debugging my code vs. the AMD code, I think the problem is solved. I was allocating memory using CL_MEM_ALLOC_HOST_PTR, while AMD Code is using only CL_MEM_READ_WRITE. After switching my code to CL_MEM_READ_WRITE, the speeds dropped to what Bragadeesh is reporting. When changing alloocation in the AMD code to CL_MEM_ALLOC_HOST_PTR, the AMD Code was getting speeds that I reported.

Lesson Learned: CL_MEM_READ_WRITE is awesome!!

Sorry for all the trouble caused and hopefully this will help other folks avoid the same error.

Austin

Archives Discussions

Help to improve clAMDFFT speed