11 Replies Latest reply on Apr 28, 2009 4:17 PM by MicahVillmow

    Shader arithmetic performance

    Firestrider

      Can the Radeon 4870 or FireStream 9270 actually reach their theoretical FLOPS in real world applications, or is it just a bunch of hype?

      I mean I've heard that the RSX in PS3 has a theoretical FLOPS output of 1.8 teraFLOPS but in real world performance it probably has no where near that.

        • Shader arithmetic performance
          kos

          What's RSX ? videochip in PS3 ? PS3 = nVidia 7800GTX + CELL = 400-500 Gflops for both. Remember that cpu's can reach their peak theoretical performance, but for gpu it's much harder to do, but I' saw some exemples where that is possible - it depends only on your code.

            • Shader arithmetic performance
              Firestrider

              Can you link me to such examples?

              According to wikipedia only the RSX (yes GFX in PS3) has a theoretical floating point arithmetic performance of 1.8 teraFLOPS and that the whole PS3 can do 2 teraFLOPS...but this could be wrong.

                • Shader arithmetic performance
                  kos

                  There is nothing about 2Tera... anything. 90nm chip is just a sh....t comparing to modern gpu's, I've seen some resourses on russian about testing gpu's, I think it won't be usefull for you.

                    • Shader arithmetic performance
                      Firestrider

                      Guess Nvidia lied then.

                      Even if it is no use to me I would like to see it be proven.

                        • Shader arithmetic performance
                          rick.weber

                          I'm using a Firestream 9170 and I have gotten peak performance (well, 498GFlops/s out of the advertised 500) with CAL. The only problem was, the operation I had it do was completely nonsense and worthless (it did I think a 64 way unrolled madd in a for loop, forcing dependencies to prevent optimization while preventing pipeline stalling). Basically, how close to peak performance you get is a function of how memory bound your application is and what fraction of your instructions are actually computation. At the top end, the demo app I wrote has almost never reads from memory and does madd instructions (2 flops) almost exclusively. Conversely, if your app is a matrix addition, you'll be lucky to get 10% of the theoretical maximum since you're always loading data from memory (unless it's somehow cached from a previous operation, though I haven't really found anything on how caching works on these GPUs so even then I can't say for certain). Also, if you want to get maximal performance, you'll need to spend a lot of time optimizing your code for the pipeline and to mask dependencies.

                          So, in short, some applications can see upwards of 80% of peak performance while others will fall flat on their face, barely achieving 1%.

                  • Shader arithmetic performance
                    pbhani

                    > Can the Radeon 4870 or FireStream 9270 actually reach their theoretical

                    > FLOPS in real world applications, or is it just a bunch of hype?

                    Not sure about nVidia, but AMD GPUs definitely can deliver the performance that we claim they do :-) However, it is important to understand what GPU performance really means.

                    Peak GPU Performance = Engine Clock * Number of Engines * Number of instructions per clock

                    For the RV770, the number is given by 750 MHz * 800 * 2 (1 MAD/cycle) = 1.2 TFlops!

                    Like someone already mentioned, you can actually write a kernel that will get you close to that performance number. However, the key issue is that:

                    - The above assumes that your kernel is _only_ doing arithmetic. So your peak ALU kernel could simply be doing a bunch of MADs and nothing else!

                    - Most real world applications need to read some input data and write some output data in order to do anything useful.

                    The above two factors imply that most real world applications will not be able to use the complete ALU muscle power of the GPU due to the simple reason that the kernels will get memory bound very soon. e.g. we GPU folks love to talk about our Matrix Multiplication kernel performance since MM has a high number of ALU operations (2n^3) compared to memory operations. Even then the best performing MM kernel that we have is memory limited! In this case, even with 100% cache efficiency, the performance is limited by the peak cache bandhwidth on the RV770 which is ~500 GB/s giving a maximum of 500 GFlops of performance for MM. Our optimized MM kernel gets close to that number on the RV770.

                    Hope this helps a bit.

                     

                      • Shader arithmetic performance
                        vvolkov

                         

                        Originally posted by: pbhaniEven then the best performing MM kernel that we have is memory limited! In this case, even with 100% cache efficiency, the performance is limited by the peak cache bandhwidth on the RV770 which is ~500 GB/s giving a maximum of 500 GFlops of performance for MM. Our optimized MM kernel gets close to that number on the RV770.


                        Does it mean that you substantially outperform AMD's best MM kernel, which, as far as I see, runs at 300 GFlops on 4870?

                          • Shader arithmetic performance
                            vvolkov

                             

                            Originally posted by: vvolkovDoes it mean that you substantially outperform AMD's best MM kernel, which, as far as I see, runs at 300 GFlops on 4870?


                            Oops, seems that I was wrong. simple_matmult in SDK runs at close to 500 GFLOPS on 4870.

                            I was confused by http://developer.amd.com/gpu_assets/IUCAA_Pune_PEEP_2008.pdf that mentions 300 GFLOPS in SGEMM and 137 GFLOPS in DGEMM on 3870. It does not seem right, as 3870's peak in double precision is only ~100 GFLOPS (right?). Also, compute_matmult in SDK indeed runs at ~300 GFLOPS but on 4870.

                              • Shader arithmetic performance
                                MicahVillmow

                                vvolkov, 

                                 What we see on our optimized MM kernel is ~540 gflops in IL. This is not the optimal as I expect that it is possible to get closer to 600+ gflops with a hand-coded isa implementation. 

                                The compute_matmult is not even close to optimal as there is a huge drop-off in performance once you hit certain matrix sizes. compute_matmult when written optimally should out-perform simple_matmult and should hit around ~640 gflops. So there is still room for improvement, but would probably be left for someone else to figure out how.

                                 

                                  • Shader arithmetic performance
                                    vvolkov

                                     

                                    Originally posted by: MicahVillmow

                                    What we see on our optimized MM kernel is ~540 gflops in IL. This is not the optimal as I expect that it is possible to get closer to 600+ gflops with a hand-coded isa implementation.



                                    Micah, thanks for your reply. Could you also comment on the numbers given in this presentation, slide 20 -  http://developer.amd.com/gpu_assets/IUCAA_Pune_PEEP_2008.pdf 

                                    It claims 137 GFLOPS in DGEMM achieved by AMD Core Math Libary using 3870, which looks very sexy. However, please correct me if I'm wrong, this is above the theoretical peak of 3870 in double precision, which is around 100 GFLOPS. Do you know how that could be possible?

                                    Thanks,

                                    Vasily