33 Replies Latest reply on Oct 21, 2010 4:37 PM by rotor

    Existing/future GPUs

    fir3ball

      I did some experiments with my current GPUs and I'm becoming more and more interested of the OpenCL platform.  I am in scientific computing (so: matrices, geometrical algorithms, etc.)

      However, I'm yet to get my hands on a high powered ATI card (like 5870/5970).  I was intrigued by the incoming nVidia Fermi/Tesla.  But the relative pricing and the stats of the nvidia seems mismatched if we compare to the ATI option.

      From stats gathered on the net, I get for Gflops

      • GTX280:  single=622  double=78
      • 5970: single=4600  double=928
      • (Current) tesla C1060:  single=933  double=78
      • (New) GTX480: single=1344 double=168              [EDIT]
      • (Future) tesla C2070:  single=N/A  double=630

      Approx pricing:  GTX280=450, 5970=650, tesla=1700

      In the past, I was kind of partial toward nVidia... but these number are totally ludicrous!

      I know this is only theoretical throughtput and real-life OpenCL code will not touch those, but even then... I would simply say wow to the 5970 (and its way cheaper than any Tesla).

       

      Am I missing anything obvious here? did I get a stat wrong?  is double-precision performance on ATI "that good" ?

      Is there a catch? something like:  on ATI the memory accesses would need to be coalesced perfectly (whereas, on GT200, the coalesced restrictions were lowered, in comparison to G80)

        • Existing/future GPUs
          nou

          well if you need DP support. then you should know that with nVidia you can get "full" DP speed only with tesla card.

          Geforce 480 have 1344.96 GFLOPS in single precision but only 168.12 GFLOPS in DP according to this article that only 1/8 of single (for tesla it will be 1/2).

          ATI Radeon have 1/5 performance in DP.

           

          • Existing/future GPUs
            MicahVillmow
            fir3ball,
            I believe the DP number on 5970 is incorrect. ATI hardware is on average 1/5th performance peak on DP. The reason is how the hardware is setup. More information can be found in the slides "ATI Stream Computing: ATI Radeontm HD 3800/4800 Series GPU Hardware Overview" on our documentation page.
            So, if the SP peak is 4600, then DP would be 1/5th or 920 gflops.

              • Existing/future GPUs
                fir3ball

                 

                Originally posted by: MicahVillmowSo, if the SP peak is 4600, then DP would be 1/5th or 920 gflops.


                Yes Micah, thank you for the feedback, but those are already the number I had up there:

                5970: single=4600  double=928

                I do not have anything to complaint with such numbers, they are -on paper- way higher than the nVidia offer.

                Even the 1-GPU 5870 seems a very very good offer, in comparison to the pricy tesla.

                 

                Thank you for the hardware overview link.  So far, I read mostly very-technical nvidia documents about the CUDA/hardware architecture, but any document specific to the ATI architecture is crucial.  Both architectures are to be taken in consideration, when building the algorithms.

                 

                  • Existing/future GPUs
                    davibu

                     

                    Even the 1-GPU 5870 seems a very very good offer, in comparison to the pricy tesla.


                     

                    This comparison is a bit unfair, you are comparing a card mainly targeted for the entratenment market with one for HPC market. You should use a GTX285, GTX295 or a new 480/470 for the comparison with the 5870.

                     

                     

                      • Existing/future GPUs
                        fir3ball

                         

                        Originally posted by: davibu
                        Even the 1-GPU 5870 seems a very very good offer, in comparison to the pricy tesla.


                         

                         

                        This comparison is a bit unfair, you are comparing a card mainly targeted for the entratenment market with one for HPC market. You should use a GTX285, GTX295 or a new 480/470 for the comparison with the 5870.

                         

                         

                        Of course.  But apparently, even the tesla do not match the double precision power of the "entertainment" ATI card.

                        My personal experience is that often its more cost effective to go with "consumer" goods than HPC/server-grade items.
                        But of course, you dont get the ECC memory and cannot have as much RAM.

                          • Existing/future GPUs
                            _Big_Mac_

                            You're getting maimed by peak flops.

                            Peak flops is the theoretical maximum derived analytically from clock frequency times number of ALUs times something, it does NOT indicate how well a card actually performs in real problems. You need benchmarks for that. I haven't found many yet, here are some from Anandtech

                            http://images.anandtech.com/graphs/nvidiageforcegtx480launch_032610115215/22215.png

                            http://images.anandtech.com/graphs/nvidiageforcegtx480launch_032610115215/22216.png

                            Note that two graphs are nowhere near an objective comparison of "compute" performance. You'd need to find more and make sure they describe roughly the same algorithms you'll be needing.

                            There's no way to judge a piece of hardware's performance based on a single number (especially peak flops). It's not even a good indication. If you were concerned with N-queens-like problems, a 1 TFLOP single-chip GTX285 would be more than twice as fast as a 4,5 TFLOP dual-chip 5970 and over 10x faster than a more comparable 4890 (1,3 TFLOPS). If you did password cracking on the other hand, 5970 would be five times faster than a GTX 295 (http://www.brightsideofnews.com/news/2010/3/16/ati-radeon-hd-5970-is-the-king-of-iphone2c-wi-fi-password-cracking.aspx).

                            So if you value your money, you should look hard for good benchmarks, making sure they are comparing apples to apples and are relevant to your field.

                              • Existing/future GPUs
                                fir3ball

                                >  There's no way to judge a piece of hardware's performance based on a single number (especially peak flops). It's not even a good indication.

                                100% with you there, the peak flops does not mean anything if you dont have the bandwidth to go along with it, for example.

                                But effectively, there are not many good comparative benchmarks to be found, and lot of good papers on the subject are only describing CUDA (nvidia) programs.  Evaluating the ATI option is quite hard.

                                It is said that the latest nvidia generation has a true L1/L2 cache hierarchy, more flexible than the "L1 local memory" + "texture L2 cache" of the previous generations.  So, complex algorithms might go really faster on those, as you dont need to fine tune the memory accesses as much.

                                So, bottom line... no real conclusion can be made right now, the latest nvidia gen is too new, but I might grab a high end ATI to test it with my use case.  As on paper, the numbers are really impressive.

                                Any further links on detailed ATI tech specs (related to OpenCL tuning) or up-to-date benchmarks are still revelant to the topic!

                                Thanks for the input.

                                  • Existing/future GPUs
                                    ryta1203

                                    Another problem with OpenCL benchmarks that I have seen from papers is that they often code for Nvidia then take that Nvidia optimized code and run it on ATI cards and compare, it's mostly nonsense.

                                    Most of the "GPU" papers that come out are done on Nvidia GPUs (and yet still titled "for the GPU", which again is nonsense).

                                      • Existing/future GPUs
                                        edward_yang

                                         

                                        Originally posted by: ryta1203 Another problem with OpenCL benchmarks that I have seen from papers is that they often code for Nvidia then take that Nvidia optimized code and run it on ATI cards and compare, it's mostly nonsense.

                                         

                                        Most of the "GPU" papers that come out are done on Nvidia GPUs (and yet still titled "for the GPU", which again is nonsense).

                                         

                                        The question is why isn't there more papers written with ATI's GPU in mind? They do they always (or at least mostly) optimize for CUDA?

                                         

                                          • Existing/future GPUs
                                            bubu

                                            GFlops != Speed. There are many other factors involved like branching, caches, occupancy and register pressure, type and quantity of VRAM, drivers, JIT compiler's optimizations, etc...

                                            For instance, I found NVIDIA cards good dealing with very branched code and scalar operations... but ATIs are much faster performing linear operations due to the wider wavefronts. So it really depends a lot on your code.

                                             

                                            Don't get fooled by GFlops numbers, just try yourself and decide what's best your your program and pockets.

                                            On the other hand, you probably will need to get ALL the cards, so you can test well the compatibility across cards and different vendors... so everytime I find a thread with that "What should I buy? NVIDIA or ATI, etc..." I laugh because the absolute truth is "You'll need both and, yes, you'll need to optimize manually for each platform".

                                            Just my 2 cents.

                                             

                                              • Existing/future GPUs
                                                zeland

                                                 

                                                Originally posted by: bubu GFlops != Speed. ... "What should I buy? NVIDIA or ATI, etc..." I laugh because the absolute truth is "You'll need both and, yes, you'll need to optimize manually for each platform".

                                                 

                                                Just my 2 cents.

                                                 

                                                I couldn't agree more.

                                                • Existing/future GPUs
                                                  rotor

                                                   

                                                  Originally posted by: bubu GFlops != Speed. There are many other factors involved like branching, caches, occupancy and register pressure, type and quantity of VRAM, drivers, JIT compiler's optimizations, etc...

                                                   

                                                  For instance, I found NVIDIA cards good dealing with very branched code and scalar operations... but ATIs are much faster performing linear operations due to the wider wavefronts. So it really depends a lot on your code.

                                                   

                                                   

                                                   

                                                  Don't get fooled by GFlops numbers, just try yourself and decide what's best your your program and pockets.

                                                   

                                                  On the other hand, you probably will need to get ALL the cards, so you can test well the compatibility across cards and different vendors... so everytime I find a thread with that "What should I buy? NVIDIA or ATI, etc..." I laugh because the absolute truth is "You'll need both and, yes, you'll need to optimize manually for each platform".

                                                   

                                                  Just my 2 cents.

                                                   

                                                   

                                                   

                                                   

                                                  I totally agree with bubu. If you just base on GFLOPs to buy a GPU, it will be similar to you just base on sensor's pixel to buy a camera (manufacturer use number of pixels to fool non-techies). The performance of a GPU depend on many aspects and mostly depend on your application (algorithm).

                                                  To correct the posts about GFLOP counting on Nvidia vs. ATI I have some clues:

                                                  -Nvidia Fermi has double precision ops equal half of the single precision ops. So the number that some one say 1/8 is not true. Basically each Fermi's Multiprocessor(a.k.a Compute Unit) has 32 single precision ALUs, when you operate double pricision they will form two single precision ALUs to perform a double precision op. You can refer more @ http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

                                                  -ATI 5x series pack 5 ALU (4 single precision ALUs to form a size-4 Very Long Instruction Work(VLWI) -a.k.a vector operation and plus 1 special function unit) to form a core. And each Multiprocessor has 16 of thats cores. So basically you can fully benefit from ATI 5x series if you work with vector size of 4. If you don't have vector size of 4 then you lost lots of performance here because 4 ALUs in a core in ATI card cannot independently perform. You can refer to ATI programming guide and ATI architecture docs to have more information. -->That's also the reason why Nvidia said that there not much benefit to use vector data type in there card because in there card they don't pact multiple ALUs into a core to form VLIWs.

                                                  - ATI 5870 preform a double precision ops by using 4 ALUs in a core together. That why you get the ratio 1/5 ( 4 ALUs + 1 SFU in a core)

                                                  -Nvidia Fermi has much larger workgroup size (1024 compare to 256 of ATI 5870). Also Fermi has upto 48KB Shared local memory compare to only 16 KB on 5870. These two things will give you lots of benefit if you wanna work with a large group of data locally. Sometime you will feel really mad with small work group size in ATI card because basically you really hard to reach maximum size of 256(16*16) if you utilize large amount of register. if you specifically have to work with square array, size power of 2 then you have to play with work group size of 64 (8x8) or even worse 16(4x4). From my experience the Fermi's large work group size give me lots of flexibilities.

                                                  -ATI has larger number of compute units (a.k.a multiprocessor, processor group...). 20 in ATI 5870 vs. 15 in Nvidia 480

                                                  So from that architecture perspective, we can see that if you does not work on vector data types, Nvidia probably will win the game due to number of cores for single element data type in a compute unit is larger and stronger, larger local memory, larger work group size support... If you have vector data type (size 4 is optimum) so ATI may win.

                                                    • Existing/future GPUs
                                                      nou

                                                      just some notes.

                                                      yes you get 1/2 DP vs SP FLOPS on fermi. but only on profesionals Quadro card. on consumer level GeForce you are limited only to 1/8 of DP FLOPS.

                                                      5xxx cards have 32kB shared local memory. IIRC and 4xxx report 32kB too but it is emulated in globl memory so it is quite irelevant.

                                                        • Existing/future GPUs
                                                          himanshu.gautam

                                                          I would like to add that the world's fastest GPU is 5970 Hemlock.So if you need a GPU for playing high-end graphics application where independent instruction are easily available ATI GPUs definitely win the race.

                                                          Finally the choice of best GPU depends on:

                                                          What algorithm you need to run on it.

                                                          What amount are you willing to pay for it.

                                                          • Existing/future GPUs
                                                            rotor

                                                             

                                                            Originally posted by: nou just some notes.

                                                             

                                                            yes you get 1/2 DP vs SP FLOPS on fermi. but only on profesionals Quadro card. on consumer level GeForce you are limited only to 1/8 of DP FLOPS.

                                                             

                                                            5xxx cards have 32kB shared local memory. IIRC and 4xxx report 32kB too but it is emulated in globl memory so it is quite irelevant.

                                                             

                                                            Hi Nou,

                                                            Yeah, you are right 5xxx have 32KB of shared memory. Sorry for my confusion (with previous gens).

                                                            However, the rumors that Geforce lelvel Fermi card only has Double Precision performance = 1/8 of Single Precision is not totally true then (actually the rumors were out very early when Nvidia first announced the Fermi family but they may be changed after that). According to Nvidia official doc (CUDA programming guide 3.2 and 3.0), all the devices that have computability 2.0 have DP performance =1/2 SP performance. The 2.0 cards includes GTX 480,470,465; Quadro 6000, 5000,5000M and tesla C2050. I attached the images here:

                                                            http://picasaweb.google.com/lh/photo/WxmmviA-X3ntenTH7zSc4w?feat=directlink

                                                            http://picasaweb.google.com/lh/photo/NZMZRtwHIKXeFAvhFJhh7Q?feat=directlink

                                                            (you also can refer to CUDA guide 3.2 p. 94 and p.100.)

                                                             

                                                             

                                                              • Existing/future GPUs
                                                                mjharvey

                                                                [quote]

                                                                According to Nvidia official doc (CUDA programming guide 3.2 and 3.0), all the devices that have computability 2.0 have DP performance =1/2 SP

                                                                [/quote]

                                                                 

                                                                Might say it there, but it ain't so. GTX 4[678]0 parts have DP rates 1/8 of their SP rates.

                                                                 

                                                                 

                                                                 

                                                                 

                                                                  • Existing/future GPUs
                                                                    rotor

                                                                    Yes may be mjharvey . But do you know any way to query that information from the device, otherwise all statements we give here are still our "guess"

                                                                      • Existing/future GPUs
                                                                        cjang

                                                                        I have spent a lot of time exploring performance characteristics
                                                                        of ATI GPUs using OpenCL.

                                                                        > is double-precision performance on ATI "that good" ?

                                                                        It surprised me. DGEMM on the 5870 reaches around 65% of peak.
                                                                        When including PCIe bus data transfers, even without DMA,
                                                                        performance is still over 20% utilization. So DP performance
                                                                        can be very good.

                                                                        DGEMM with host to GPU bus data transfer: http://golem5.org/gatlas/CaseStudyGATLAS_files/CaseStudyGATLAS_htm_m28ceac04.jpg

                                                                        DGEMM kernel only: http://golem5.org/gatlas/CaseStudyGATLAS_files/CaseStudyGATLAS_htm_m5306526d.jpg

                                                                        DGEMM with GPU to host data transfer: http://golem5.org/gatlas/CaseStudyGATLAS_files/CaseStudyGATLAS_htm_5d453410.jpg

                                                                        I see DP as one of the bright spots of OpenCL on ATI GPUs.
                                                                        It maximizes strengths and minimizes weakneses. DP also
                                                                        happens to be what scientific and quantitative applications
                                                                        generally need in real-life.

                                                                        > Is there a catch?

                                                                        There are different sets of tradeoffs.

                                                                        NVIDIA has a deeper, more mature software stack and body of
                                                                        research and applications experience around their GPGPU products.
                                                                        They also have ECC memory in the Tesla cards. That is a must-have
                                                                        for some users.

                                                                        ATI has both lower prices and higher peak performance. The GPUs
                                                                        are more efficient and probably a lot cheaper to manufacture. ATI
                                                                        clearly wins for value if you look at the hardware. However, OpenCL
                                                                        is still quite new on ATI GPUs. It was first released around a year
                                                                        ago. Before that, your options were IL/ISA or stuff like Brook,
                                                                        which is no longer supported.

                                                                        I believe the difference in relative pricing is mostly due to the
                                                                        economic value the market has placed on software development and
                                                                        support costs.

                                                                        > So basically you can fully benefit from ATI 5x series if you work
                                                                        > with vector size of 4.

                                                                        My experience is that it is somewhat more complex than this. The
                                                                        optimal vector length varies so that float2 is often faster than
                                                                        float4.

                                                                        Matrix multiply with memory buffers: http://golem5.org/gatlas/CaseStudyGATLAS_files/CaseStudyGATLAS_htm_m2c6b117f.jpg

                                                                        Traditional loop based code transformations like strip mining
                                                                        (vectorizing), interchange, tiling, etc work very well on ATI GPUs.
                                                                        The result is high performance in OpenCL. That's a good story.
                                                                        However, with the current state of compiler technology, the
                                                                        developer must do all of this manually by hand.

                                                                        • Existing/future GPUs
                                                                          moozoo

                                                                           

                                                                          Originally posted by: rotor Yes may be mjharvey . But do you know any way to query that information from the device, otherwise all statements we give here are still our "guess"


                                                                          http://forums.nvidia.com/index.php?showtopic=164417

                                                                          If the particular kernal is memory bandwidth limited then the effect can be masked. i.e. on some problems the GTX480 can have a higher DP though put due to its higher clock -> higher memory bandwitdh.

                                                                          My guess is that the chips start out the same. A fused link is blown on the chip for the ones headed for the graphics card market that cripples the DP performance.

                                                                          I worry that AMD might follow Nvidia's example and cripple or worst design for much lower DP performance in the 6xxx series.

                                                                           

                                                                          • Existing/future GPUs
                                                                            dravisher

                                                                             

                                                                            Originally posted by: rotor Yes may be mjharvey . But do you know any way to query that information from the device, otherwise all statements we give here are still our "guess"

                                                                             

                                                                            Nvidia do confirm that the consumer card's DP performance is limited to 1/8 of SP here, so I'd say the issue is pretty much settled. Specifically they state:

                                                                             

                                                                            Double precision is 1/2 of single precision for Tesla 20-series, whereas double precision is 1/8th of single precision for GeForce GTX 470/480


                                                                            I guess the issue is that this isn't exactly something that they advertise about the consumer cards. Indeed there are plenty of people who aren't aware of this, I've even spoken with people in the scientific community using GTX480s who thought it had 1/2 DP rate!

                                                                              • Existing/future GPUs
                                                                                rotor

                                                                                 

                                                                                Originally posted by: dravisher
                                                                                Originally posted by: rotor Yes may be mjharvey [IMG][/IMG]. But do you know any way to query that information from the device, otherwise all statements we give here are still our "guess" [IMG][/IMG]

                                                                                 

                                                                                 

                                                                                 

                                                                                 

                                                                                Nvidia do confirm that the consumer card's DP performance is limited to 1/8 of SP here, so I'd say the issue is pretty much settled. Specifically they state:

                                                                                 

                                                                                 

                                                                                Double precision is 1/2 of single precision for Tesla 20-series, whereas double precision is 1/8th of single precision for GeForce GTX 470/480


                                                                                 

                                                                                I guess the issue is that this isn't exactly something that they advertise about the consumer cards. Indeed there are plenty of people who aren't aware of this, I've even spoken with people in the scientific community using GTX480s who thought it had 1/2 DP rate!

                                                                                 

                                                                                 

                                                                                Thank you all for informative discussions. Yeah, it is kind of bad when Nvidia do this . I've already know this issue for a while but still don't want to believe and what dravisher gave out here once again confirm the problem. However, I feel really bad about Nvidia that they give out two opposite statements in two different documents. As the link I post above, in CUDA programming guide 3.2 and 3.0 they still listed GTX470/480 into 2.0 computability as Tesla cards which has 1/2 DP (and may be that why "people in the scientific community using GTX480s who thought it had 1/2 DP rate!").

                                                                                Anyhow, back to performance of the cards, I think it really depends on how you optimized your code to adapt with your given hardware.

                                                            • Existing/future GPUs
                                                              ryta1203

                                                               

                                                              Originally posted by: _Big_Mac_ You're getting maimed by peak flops.

                                                              Peak flops is the theoretical maximum derived analytically from clock frequency times number of ALUs times something, it does NOT indicate how well a card actually performs in real problems. You need benchmarks for that. I haven't found many yet, here are some from Anandtech

                                                              http://images.anandtech.com/graphs/nvidiageforcegtx480launch_032610115215/22215.png

                                                              http://images.anandtech.com/graphs/nvidiageforcegtx480launch_032610115215/22216.png

                                                              Note that two graphs are nowhere near an objective comparison of "compute" performance. You'd need to find more and make sure they describe roughly the same algorithms you'll be needing.

                                                              There's no way to judge a piece of hardware's performance based on a single number (especially peak flops). It's not even a good indication. If you were concerned with N-queens-like problems, a 1 TFLOP single-chip GTX285 would be more than twice as fast as a 4,5 TFLOP dual-chip 5970 and over 10x faster than a more comparable 4890 (1,3 TFLOPS). If you did password cracking on the other hand, 5970 would be five times faster than a GTX 295 (http://www.brightsideofnews.com/news/2010/3/16/ati-radeon-hd-5970-is-the-king-of-iphone2c-wi-fi-password-cracking.aspx).

                                                              So if you value your money, you should look hard for good benchmarks, making sure they are comparing apples to apples and are relevant to your field.

                                                               

                                                              LOL, that is probably PCCHEN's N-queen solver: http://forum.beyond3d.com/showthread.php?p=1415730#post1415730

                                                               

                                                              Which is NOT optimized for ATI GPUs, I believe it's quite easy in fact to get a solid performance increase with just a few techniques from that code. In fact, my "Profiler Question" thread is referring to that code.

                                                              This is the EXACT nonsense I was talking about in the above post, nonsense!

                                                                • Existing/future GPUs
                                                                  _Big_Mac_

                                                                  Never said either of those benchmarks were any good

                                                                  It's very true that finding apples to apples comparisons is tricky, especially since the GPUs' designs are different. Even with 'portable' OpenCL, you'd still write an AMD-optimized kernel using ex. vector instructions, in contrast to NV's scalar programming style. And when you have different algorithms, one vectorized and another scalar, someone is bound to call "oranges". How will you prove your AMD implementation isn't more fine-tuned to a Radeon than your NVIDIA implementation to a GeForce?

                                                                  So, same code is unfair, different code is difficult to compare...

                                                                    • Existing/future GPUs
                                                                      ryta1203

                                                                      Either way, the N-queen solver Ryan Smith (anandtech) uses is now mostly outdated since PCCHEN (the code they used) has somewhat vectorized his code and eliminated CF from at least one kernel, getting some decent speedup.

                                                                      Also, it's not whether code is unfair, it's whether the comparisons are absurd. The fact that the "community" insists on branches all GPUs together and "assuming" that they are "pretty much the same" is silly, that's all I'm saying.

                                                                      I've seen A TON of "GPU" papers where the optimization or algorithm was only coded on Nvidia. I'm not saying, from a review standpoint, they have to do both, but at least give the paper a proper title and stop trying to expand your work beyond it's scope, I really find this so annoying.

                                                                      Particularly because if it's done with Nvidia no one cares, but if you do with with AMD then everyone asks "does this work for Nvidia? and if not, why should we care?"

                                                                      "The world outside CUDA", LOL.

                                                                        • Existing/future GPUs
                                                                          douglas125

                                                                          Sorry to bring up such an old thread, but I gotta ask, what are the texture cache sizes (read_imagef in mind) for high-end GPUs around?

                                                                          I hear all the Radeon 5000 GPUs and geForce 200s have 16 kb and that geforce 400s has 128 kb is that correct?

                                                                          And I'm dying curious, what will be the texture cache of the 6000 series? Please answer "256 kb" or more

                                                          • Existing/future GPUs
                                                            jcpalmer

                                                            Pretty good points.  I only need single precision myself.  If there was a catch, it is the 5970 is I think a 2 GPU card, that was produced in very low quantities.  The 2 GPU part means a lot more work from the software dev side to un-lock, so total cost looks different.  

                                                            Making your point with the 5870 would side step this reservation. 

                                                            • Existing/future GPUs
                                                              MicahVillmow
                                                              nou,
                                                              We also expose double support on the 47XX-49XX series of cards(though i'm not 100% sure about the HD4770, I believe it also has double).