We really can't comment on future products or software releases in this forum until they are officially announced.

We are considering GPU accelerated FFTs for possible inclusion in future releases of ACML-GPU, or OpenCL, or both, but we can't say anything more definite than that.

FFTs are a problem because, unless the arrays are very large (millions of data samples) they tend to be memory bound or data-transfer bound and not compute bound. Small FFTs (only thousands or hundreds of samples) can't really benefit from GPU acceleration for that reason.

We really can't comment on future products or software releases in this forum until they are officially announced.

We are considering GPU accelerated FFTs for possible inclusion in future releases of ACML-GPU, or OpenCL, or both, but we can't say anything more definite than that.

I see, OK, will wait.

FFTs are a problem because, unless the arrays are very large (millions of data samples) they tend to be memory bound or data-transfer bound and not compute bound. Small FFTs (only thousands or hundreds of samples) can't really benefit from GPU acceleration for that reason.

Doing such conclusion I think you are aware about CUFFT existance? nVidia implemented FFT for its CUDA and it proved to be very useful for my goals. So, when stating "FFT can't benefit from GPU", please, add for ATI GPU or not make such conclusions. ATI just has huge kernel call overhead (at least in Brook implementation), maybe this is the reason of no performance gain, not FFT itself? And why talking about millions samples if current Catalyst can't support array more than 8192 samples long?...

Having FFT implemented inside GPU can save from data moving from/to system memory that is true performance killer. Memory bound GPU FFT still much better than memory transfer back to system memory just to make FFT.

if I recall his nickname correctly, he's one of the great guys trying to port seti@home to ATi Stream... (GPL code)

many BOINC projects (and not only them...) are waiting for you (ATi) to port at least some types of high performance ffts to GPU like nvidia has done with cufft.

No news yet, but in the meantime, could you tell me more about your application and how you are using FFTs?

(Without disclosing anything confidential, of course)

Actually FFT needed for 2 apps. One already ported to CUDA and their CUFFT lib works well. Here 1D FFT with power of 2 sizes required starting from 8 to 32k. Another app has much lesser FFT size variability, only 1024 and 32k sizes used. I mostly interested in 1D 32k inverse FFT now. Signal in time domain converted in freq domain by 32k FFT then some de-dispersion applied (there are many de-dispersion patterns used so direct transform performed much less often than inverse one) then by 32k inverse FFT data is transfered back into time domain. Then some folding and pulse finding takes place. Unfortunately, this last part pretty memory-bound one. Most speedup could come from doing de-dispersion (and FFT, for CPU version FFT takes >60% of whole run time) on GPU, but data transfer costs between FFTs will eat up all benefits if FFT will be done on CPU.

At some point soon, when I get some more progress made on my projects, I expect to port it to OpenCL running on Steam and CUDA and see if I can get any speed up by running on heterogenous devices simultaneously.

At some point soon, when I get some more progress made on my projects, I expect to port it to OpenCL running on Steam and CUDA and see if I can get any speed up by running on heterogenous devices simultaneously.

At some point soon, when I get some more progress made on my projects, I expect to port it to OpenCL running on Steam and CUDA and see if I can get any speed up by running on heterogenous devices simultaneously.

Will look, thanks a lot for link!

Got it compiled and linked OK (with some code reduction) under VS2008 on Vista, but got clGetComputeDevice failed error at runtime. Host with AMD OpenCL CPU ltatform only, w/o any compatible GPUs installed.

You've been asking about GPU acelleration of FFT in ACML, but from your comments, I don't think that's what you really want.

All of the ACML FFT APIs take input data from CPU memory and leave their output in CPU memory. That would still be true if we implemented GPU acelerartion, and the library implementation would be stuck with the data transfer costs in both directions.

If I understand you correctly, you really want to do your forward FFT, filtering, and inverse FFT all on the GPU without transfers back and forth in between, which sounds more like you need an OpenCL FFT library.

You've been asking about GPU acelleration of FFT in ACML, but from your comments, I don't think that's what you really want.

All of the ACML FFT APIs take input data from CPU memory and leave their output in CPU memory. That would still be true if we implemented GPU acelerartion, and the library implementation would be stuck with the data transfer costs in both directions.

If I understand you correctly, you really want to do your forward FFT, filtering, and inverse FFT all on the GPU without transfers back and forth in between, which sounds more like you need an OpenCL FFT library.

Yes, if ACML can't take input array from GPU and leave output on GPU, then it's not that I need. Indeed, for good pervormance I need that data was leaved on GPU between FFT transforms. So will look for OpenCL FFT library (in general, I need for ATI GPU smth like CUFFT for CUDA, that library quite fits my needs).

Hello Raistmer,

We really can't comment on future products or software releases in this forum until they are officially announced.

We are considering GPU accelerated FFTs for possible inclusion in future releases of ACML-GPU, or OpenCL, or both, but we can't say anything more definite than that.

FFTs are a problem because, unless the arrays are very large (millions of data samples) they tend to be memory bound or data-transfer bound and not compute bound. Small FFTs (only thousands or hundreds of samples) can't really benefit from GPU acceleration for that reason.

I see, OK, will wait.

Doing such conclusion I think you are aware about CUFFT existance?

nVidia implemented FFT for its CUDA and it proved to be very useful for my goals.

So, when stating "FFT can't benefit from GPU", please, add

for ATI GPUor not make such conclusions. ATI just has huge kernel call overhead (at least in Brook implementation), maybe this is the reason of no performance gain, not FFT itself?And why talking about millions samples if current Catalyst can't support array more than 8192 samples long?...

Having FFT implemented inside GPU can save from data moving from/to system memory that is true performance killer.

Memory bound GPU FFT still much better than memory transfer back to system memory just to make FFT.

http://img408.imageshack.us/img408/6779/fusiondie.jpg

fantastic die

http://img21.imageshack.us/img21/1295/fusiondie3.jpg

How view fusion die?

Hello Raistmer,

No news yet, but in the meantime, could you tell me more about your application and how you are using FFTs?

(Without disclosing anything confidential, of course)

if I recall his nickname correctly, he's one of the great guys trying to port seti@home to ATi Stream... (GPL code)

many BOINC projects (and not only them...) are waiting for you (ATi) to port at least some types of high performance ffts to GPU like nvidia has done with cufft.

Actually FFT needed for 2 apps. One already ported to CUDA and their CUFFT lib works well. Here 1D FFT with power of 2 sizes required starting from 8 to 32k.

Another app has much lesser FFT size variability, only 1024 and 32k sizes used.

I mostly interested in 1D 32k inverse FFT now.

Signal in time domain converted in freq domain by 32k FFT then some de-dispersion applied (there are many de-dispersion patterns used so direct transform performed much less often than inverse one) then by 32k inverse FFT data is transfered back into time domain. Then some folding and pulse finding takes place.

Unfortunately, this last part pretty memory-bound one. Most speedup could come from doing de-dispersion (and FFT, for CPU version FFT takes >60% of whole run time) on GPU, but data transfer costs between FFTs will eat up all benefits if FFT will be done on CPU.

Did you try looking at:

http://developer.apple.com/mac/library/samplecode/OpenCL_FFT/index.html

At some point soon, when I get some more progress made on my projects, I expect to port it to OpenCL running on Steam and CUDA and see if I can get any speed up by running on heterogenous devices simultaneously.

Will look, thanks a lot for link!

Got it compiled and linked OK (with some code reduction) under VS2008 on Vista, but got

clGetComputeDevice failed

error at runtime. Host with AMD OpenCL CPU ltatform only, w/o any compatible GPUs installed.

Hi Raistmer,

You've been asking about GPU acelleration of FFT in ACML, but from your comments, I don't think that's what you really want.

All of the ACML FFT APIs take input data from CPU memory and leave their output in CPU memory. That would still be true if we implemented GPU acelerartion, and the library implementation would be stuck with the data transfer costs in both directions.

If I understand you correctly, you really want to do your forward FFT, filtering, and inverse FFT all on the GPU without transfers back and forth in between, which sounds more like you need an OpenCL FFT library.

Yes, if ACML can't take input array from GPU and leave output on GPU, then it's not that I need.

Indeed, for good pervormance I need that data was leaved on GPU between FFT transforms. So will look for OpenCL FFT library (in general, I need for ATI GPU smth like CUFFT for CUDA, that library quite fits my needs).