21 Replies Latest reply on Jan 3, 2010 8:40 PM by Raistmer

    FFT library for ATI GPU - does it exist?

    Raistmer

      Hello

      AFAIK current ACML-GPU release doesn't use GPU for FFT routines. Does any FFT library exist that make use of ATI GPUs ?

       

        • FFT library for ATI GPU - does it exist?
          Raistmer

          Well, I rephrase:

          When FFT on ATI GPUs will be supported in ACML ?

          CUDA had it's CUFFT library from very first release BTW

            • FFT library for ATI GPU - does it exist?
              Russian

               

              Originally posted by: Raistmer Well, I rephrase:

               

              When FFT on ATI GPUs will be supported in ACML ? CUDA had it's CUFFT library from very first release BTW

               

               

              Library in CUDA working with 1 % of decleared GPU perfomance. Better to use Ipp or library for AMD then the CUDA.

                • FFT library for ATI GPU - does it exist?
                  Raistmer
                  E, ne skagi.
                  Poprobuj CUFFT 2.3, suschestvenno shustree stalo.
                  Opyat' ge, esli vse na GPU delat', to IPP/ACML/FFTW potrebuyut chtoby snachala dannye peregnal v osnovnuyu pamyat', a potom rezul'taty preobrazovaniya obratno v GPU.
                  Nakladnyh rashodov more. Tak chto vopros actualen.

                  In short, using CPU FFT libraries required data copy to/from host memory - too costly operation to be efffective.
                    • FFT library for ATI GPU - does it exist?
                      Russian

                       

                      E, ne skagi. Poprobuj CUFFT 2.3, suschestvenno shustree stalo.


                       

                      I have tested with 2.3, and kernel only.

                      Pro peredachu dannih bez DMA ya voobsche molchu. :-)

                          • FFT library for ATI GPU - does it exist?
                            Raistmer
                            Originally posted by: shormanm

                            Have you tried ACML-GPU?


                            ---->[/Q
                            http://develo.../a...s/default.aspx[/q

                            [/L]

                            Did they released new version?
                            Inital version had no GPU acceleration for FFT, only some GPU BLAS routines AFAIK.

                            "
                            ATI Stream-accelerated routines:

                            SGEMM
                            DGEMM
                            "
                              • FFT library for ATI GPU - does it exist?
                                godsic

                                to Raistmer:

                                Nado gnat PCIex i brat kachestvennuu mat chtoby pre peredachi dannyh po PCIex ne prihodilos peredavat po 20 raz

                                Polnostu soglasen po povodu vse na GPU, no est i svoi tonkosti. Naprimer nado sdelat 2D FFT na 9Gb dannyh i tut uzh...

                                  • FFT library for ATI GPU - does it exist?
                                    shormanm

                                    English please guys.

                                    • FFT library for ATI GPU - does it exist?
                                      riza.guntur

                                      Oh man... English please, I want to learn too.

                                        • FFT library for ATI GPU - does it exist?
                                          godsic

                                          The main point is USING ONLY LOCAL GPU MEMORY AND GPU for data processing. It is very easy for image, video, some signal, primitive in-game physics, ray-tracing processing. When you use GPU you need break all your knowledges about fast efficient CPU algorithms. In many cases they will be inefficient for GPU (VECTORIZATION , STREAMING - the main things for GPU, not the operation count!).

                                          But in more important area - scientific simulations - you may found that you need to process large amount of data (much large in compare to local GPU memory) and in this case speed of Remote memory, PCIex and even CPU (because of architecture of modern AMD North Bridges!!!! HELLO AMD! ). You need to divide you data in some portions and reload each part to GPU memory. Therefore you need to buy only top-quality mainboards (ASUS, MSI), hi-end CPU (phenom 2 x4) and try to overclock your memory and PCIex as much as it possible for stable work. PCIex can be very unstable, and therefore (thanks PCIex specification) your data can be sended 20 times, till PCIex host controller receive it without errors).

                                            • FFT library for ATI GPU - does it exist?
                                              Raistmer
                                              Yes, different tasks --> different approaches.
                                              My dataset fits in GPU memory completely so wanna see GPU accelerated FFT from GPU chip vendor - ATI/AMD

                                              Originally posted by: godsic

                                              PCIex can be very unstable, and therefore (thanks PCIex specification) your data can be sended 20 times, till PCIex host controller receive it without errors).


                                              Interesting, these retries can be seen programatically (i.e. some retry counter in hardware?) or only indirectly by measuring real transfer speed and comparing it with theoretical value for same freq/bus width ?
                                                • FFT library for ATI GPU - does it exist?
                                                  godsic

                                                  Maybe can, but we need ask AMD about it? Therefore direct measuring of PCIex speed (with AMD utility) is the only way now.

                                                  Generally PCIex is just ultra speed and more functional USB.

                                                  I need to write FFT in near future for some reasons, so after I will post code here.

                                                   

                                                    • FFT library for ATI GPU - does it exist?
                                                      Raistmer
                                                      Originally posted by: godsic


                                                      I need to write FFT in near future for some reasons, so after I will post code here.




                                                       



                                                      Good news!
                                                      There is some FFT implementation for CUDA posted by Volkov on nVidia forums, but unfortunately only power of 2 up to 8192 supported for now AFAIK.
                                                      And I need 32k FFT
                                                        • FFT library for ATI GPU - does it exist?
                                                          godsic

                                                          It must be power of 2!. Or you need to use interpolation to power of 2. Some using zero padding (cause some painful artifacts) . 32k FFT will depends on CPU-NB speed because of GPU limitation. 8192 is just address alignment and memory paging architecture of GPU limitation. Maybe AMD using  13 bit  page memory registers in their TMU

                                                          32k 2D or 1D? Therefore 2D or 1D FFT?

                                                            • FFT library for ATI GPU - does it exist?
                                                              riza.guntur

                                                               

                                                              Originally posted by: godsic It must be power of 2!. Or you need to use interpolation to power of 2. Some using zero padding (cause some painful artifacts) . 32k FFT will depends on CPU-NB speed because of GPU limitation. 8192 is just address alignment and memory paging architecture of GPU limitation. Maybe AMD using  13 bit  page memory registers in their TMU

                                                              32k 2D or 1D? Therefore 2D or 1D FFT?

                                                              The limitation comes from maximum texture size the card can handle.

                                                              AFAIK the 3D stream don't have such limitation (yet, I don't see 3D stream size information anywhere in Stream Computing User Guide, you may want to refer anything from there).

                                                              If you want to build FFT for ATI, make sure to create different function for different size. Small, medium and large must use different approach.

                                                              • FFT library for ATI GPU - does it exist?
                                                                Raistmer
                                                                Originally posted by: godsic

                                                                It must be power of 2!. Or you need to use interpolation to power of 2. Some using zero padding (cause some painful artifacts) . 32k FFT will depends on CPU-NB speed because of GPU limitation. 8192 is just address alignment and memory paging architecture of GPU limitation. Maybe AMD using  13 bit  page memory registers in their TMU




                                                                32k 2D or 1D? Therefore 2D or 1D FFT?



                                                                1D FFT only here.
                                                                This app uses only 32k FFT, but another one I'm interested in uses different power of 2 down to size of 8
                                                              • FFT library for ATI GPU - does it exist?
                                                                Russian

                                                                 

                                                                Originally posted by: Raistmer

                                                                Good news! There is some FFT implementation for CUDA posted by Volkov on nVidia forums, but unfortunately only power of 2 up to 8192 supported for now AFAIK. And I need 32k FFT



                                                                 

                                                                Bad news that it will take only 20% of GPU perfomance. Memory bandwidth is not a bottelneck in this case.

                                                                  • FFT library for ATI GPU - does it exist?
                                                                    godsic

                                                                    to Russian:

                                                                    I think that count of TMU and their performance are dominant things. Because for each  element of array you need to fetch a lot of other data.

                                                                    For FFT algorithms GPU will probably need large caches and TMU count.

                                                                    Other thing is the data represintation, dont remember that even 1D array must be rearranged to 2D with, if it possible, M x wavefrontsize *N size, where M and N possitive integers, it will be greate if M and wavefrontsize *N are power of 2 numbers . If size is differ than you will note utilizes the full power of GPU and on-chip cache will work inefficient.  Typically, for R6xx and R7xx  wavefrontsize = 64.

                                                                    to riza.guntur:

                                                                    Yes, but texture size limitation cames from memory organization and TMU architecture. Maximum dimension for 3D texture is 8192x8192x8192!

                                                                    As far as I know CAL runtime can virtualize addresses and therefore accept large arrays with some reduction in performance.

                                                                      • FFT library for ATI GPU - does it exist?
                                                                        Raistmer
                                                                        Originally posted by: godsic

                                                                        As far as I know CAL runtime can virtualize addresses and therefore accept large arrays with some reduction in performance.



                                                                        That not work in last Catalyst versions
                                                                        • FFT library for ATI GPU - does it exist?
                                                                          Russian

                                                                           

                                                                          Originally posted by: godsic to Russian:

                                                                           

                                                                          I think that count of TMU and their performance are dominant things. Because for each element of array you need to fetch a lot of other data.

                                                                           

                                                                          For FFT algorithms GPU will probably need large caches and TMU count.

                                                                           

                                                                          Other thing is the data represintation, dont remember that even 1D array must be rearranged to 2D with, if it possible, M x wavefrontsize *N size, where M and N possitive integers, it will be greate if M and wavefrontsize *N are power of 2 numbers . If size is differ than you will note utilizes the full power of GPU and on-chip cache will work inefficient. Typically, for R6xx and R7xx wavefrontsize = 64.

                                                                           

                                                                           

                                                                           

                                                                          No problems. If it is possible to get 97% of perfomance even only for one size of FFT (16384, 1024 or for 256 for example) then it will be perfect. It is not a problem to make N*x FFTs.

                                                                          With CUDA people got 20% of GPU perfomance in best case. And explanation was, that it is to much data transfered. IHMO, this is wrong.

                                                                           

                                              • FFT library for ATI GPU - does it exist?
                                                Raistmer
                                                Any news on the project?