Originally posted by: godsic to Russian:
I think that count of TMU and their performance are dominant things. Because for each element of array you need to fetch a lot of other data.
For FFT algorithms GPU will probably need large caches and TMU count.
Other thing is the data represintation, dont remember that even 1D array must be rearranged to 2D with, if it possible, M x wavefrontsize *N size, where M and N possitive integers, it will be greate if M and wavefrontsize *N are power of 2 numbers . If size is differ than you will note utilizes the full power of GPU and on-chip cache will work inefficient. Typically, for R6xx and R7xx wavefrontsize = 64.
No problems. If it is possible to get 97% of perfomance even only for one size of FFT (16384, 1024 or for 256 for example) then it will be perfect. It is not a problem to make N*x FFTs.
With CUDA people got 20% of GPU perfomance in best case. And explanation was, that it is to much data transfered. IHMO, this is wrong.