17 Replies Latest reply on Jan 30, 2012 11:34 AM by Meteorhead

    Recommendations for building a GPU number cruncher

    toffyrn

      We are discussing/planning on building a computer at school to experiment with parallelization within one machine, and then in particular GPGPU.

      I would appreciate someone explaining the difference between typical gaming cards and GPGPU-cards (such as nv-Tesla or amd-firestream). Looking at the specs a 6970 would outperform other cards in every way other than the size of memory. If that is correct i don't see why anyone should buy the pro cards? :P

      Planning on running this with linux (probably ubuntu), are there any dificulties with having two devices available for OpenCL?

      Our current draft looks like this; 2x 6core amd cpu, 2x radeonhd6970, 48GB ram.

      Feel free to comment this layout!

        • Recommendations for building a GPU number cruncher
          Meteorhead

          There are a few differences between consumer and professional cards, but mainly on NVIDIA's side.

          NV Tesla cards offer ECC RAM, which ensures the correctness of long run simulations. In gaming, it really doesn't matter if one frame is messed up every 10.000.000.000 frames without any correction or prior notice. In scientific computing however error is most likely to be carried corrupting the entire simulation. Regular GeForce cards use everyday GDDR5 and have their double-precision capacity held back artificially, in order to force customers to buy the more expensive Teslas.

          FirePros as far as I know do not feature ECC RAM, but please correct me if I'm wrong, they are tuned for CAD applications, not even neccessarily OpenCL. ATM I really don't see any reason to buy FirePros for scientific use. Radeons feature all the capability of a certain generation, and dual-gpu cards are as equally good. They function as two independant cards, but inserted into the same PCI-E slot.

          Multi-gpu support is... well... unofficial. It works (most of the times), except when the drivers or the runtime is messed up (like now). In these cases you have to wait either 1-2-3 months for a driver fix, or 6 months for a runtime fix.

          On NV side OCL 1.1 production drivers just got released a month ago, but there if something is buggy, you have to wait 0.5-1.5 years for a fix, since drivers and SDK releases take an extremely long time.

          Your buildup seems good enough for testing and learning. If you wish to increase capacity, I would install dual-gpu solutions, but most likely wait for the HD7xxx series coming soon and increasing capacity and capability by a LOT.

            • Recommendations for building a GPU number cruncher
              toffyrn

              Thank you for a good answer :D


              We will be using it for computations in many-body quantum systems (quantum chemistry and nuclear modelling). Because of the size of our systems double precision would be essential, and thus maybe ECC to? How often may one encounter errors with regular memory?


              When it comes to cards, would a 6990 be easier to work with than 2x6970 giving the same performance? If anyone has tried multi-GPU setups in linux I would appreciate feedback on this! :)

              As my thesis should be completed by june 2012, I would need to run testings and to get some results befor spring-time 2012. I would therefore prefer to get this machine up and running within some months, instead of waiting for radeonhd7000..

                • Recommendations for building a GPU number cruncher
                  Meteorhead

                  I have 3 HD5970s in a machine running Ubuntu and it worked fine up until SDK 2.5 . After update system crashes upon multi-gpu usage.

                  About ECC RAMs. I have been programming GPUs ever since OpenCL came to be (more than 2 years ago), and I have never encounered such a situation, or it wasn't visible. I have heard that if you copy from global memory to registers, and back, and do this for a long time, the GPU will make a mistake eventually. With ECC RAM, this eventually shifts out to practically happening never. This is not a realy threat (in my opinion), but having ECC RAM does add a feel of professionalism to it.

                  About software support, experiences do differ. I hardly get replies on NV forums, not to mention official support. I feel there's a lot better conversation with the developers here. (But please, let's not start this argue here. It's a matter of luck.)

                  • Recommendations for building a GPU number cruncher
                    eugenek

                     

                    Originally posted by: toffyrn Thank you for a good answer :D

                     

                    We will be using it for computations in many-body quantum systems (quantum chemistry and nuclear modelling). Because of the size of our systems double precision would be essential, and thus maybe ECC to? How often may one encounter errors with regular memory?

                     

                    When it comes to cards, would a 6990 be easier to work with than 2x6970 giving the same performance? If anyone has tried multi-GPU setups in linux I would appreciate feedback on this! :)

                     

                    As my thesis should be completed by june 2012, I would need to run testings and to get some results befor spring-time 2012. I would therefore prefer to get this machine up and running within some months, instead of waiting for radeonhd7000..

                     

                    6990 is somewhat slower than 2x 6970. The advantage of a 6990 is that you can have two of them. But that will cost you craploads of money in power and cooling (you need something like a 1000 W power supply and water cooling to drive two 6990's without jeopardizing stability).

                    I wouldn't bother with ECC RAM, unless it's a kind of task where you need to run computations for a week continuously and a single memory glitch could render the whole result useless.

                    I'd also strongly consider the upcoming LGA-2011 Sandy Bridge CPUs instead of AMD to build the main platform. AMD has some significant advantages in the GPU field, but CPUs are a different story.

                      • Recommendations for building a GPU number cruncher
                        settle

                         

                        Originally posted by: eugenek

                         

                        I'd also strongly consider the upcoming LGA-2011 Sandy Bridge CPUs instead of AMD to build the main platform. AMD has some significant advantages in the GPU field, but CPUs are a different story.

                         

                        I disagree with your recommendation.  I think your recommendation comes from wanting to pair devices with the best benchmarks (which may or may not be an accurate assessment of one's use cases), but there are other things one should also consider.  For one, Sandy Bridge CPUs don't give you access to the on die GPU like the AMD Fusion APUs.  Intel has made no public commitment to giving developers access to this in their OpenCL SDK; however AMD Fusion APUs have this support now.  Secondly, while you certainly aren't required to, pairing an AMD CPU and GPU makes developing and getting support for OpenCL code much easier especially if you have an APU as you can use a single OpenCL context to share all memory objects.

                    • Recommendations for building a GPU number cruncher
                      maximmoroz

                      > On NV side OCL 1.1 production drivers just got released a month ago, but there if something is buggy, you have to wait 0.5-1.5 years for a fix, since drivers and SDK releases take an extremely long time

                      Well, my expirience is different. I have submitted 2 bugs to NVidia. The 1st one was fixed in a month, for the 2nd one I was told that they are aware of the bug and the new release of software (major redesign) will fix it.

                      I have submitted one bug to AMD several months ago and it is not fixed yet.

                    • Recommendations for building a GPU number cruncher
                      MicahVillmow
                      maximmoroz,
                      As far as I know, your bug should be fixed in the next release. We are working on improving our systems so hopefully some of the issues you have had in the past will be resolved.
                      • Re: Recommendations for building a GPU number cruncher
                        Meteorhead

                        Hi, let me revive this topic with a question:

                         

                        We are opting for money for a GPU cluster at our institute, and we are considering single-node/many-GPU machines to enable the use of low-latency access across many GPUs for capable multi-GPU simulations. I would have several questions infact:

                         

                        • Are there any information on ECC enabled 79-70/90 devices coming in the near future? Tahiti was waving the flag of ECC enabled, but I don't see any products that would have ECC. Most likely the FirePros will feature them, but some assurance would be welcome.
                        • Is there a limit to how many cards can the driver recognize in one machine? If yes, how much would it take to abolish the limit?
                        • Does anyone have experience with PCI-E extenders and AMD/ATi cards? We are particaularly interested in the following two products (but any experience is welcome):

                                  http://www.cubix.com/content/gpu-xpander-rack-mount

                                  http://www.magma.com/Products/basic/pcie/expressbox16/ExpressBox%2016%20Basic%20DS%20%28Web%29%2070-02360-50-A.pdf

                                  The reason I ask is because all of these companies advertise themselves by being CUDA-ready, but I doubt that any vendor-specific stuff would be needed to get it working. My guess is that they say CUDA-ready, because that is what everybody cares about. Would fglrx recognize cards behind such extensions?

                        • Does anyone know of TRULY single slot watercooling block for any AMD card? All of them adverts say that they are truly and magnificently single-slot cooling solutions, but the pipes they all extrude into the neighbouring slot, there is no way of placing two cards in neighbouring slots (especially not 16 of them, as it would be possible in the case of Magma extension box). Aquacomputer is the only company that seemed to serve such water blocks, but they seem to have ceased to exist like 8 years ago.
                        • Does anyone know a viable AMD-based motherboard for a serious GPU number-cruncher? Something like the well-known TYAN beast: www.tyan.com/product_SKU_spec.aspx?ProductType=BB&pid=412&SKU=600000188
                        • Is there a plan to create dual-GPU HPC cards with ECC?
                        • Will there ever be server processors with strong IGPs?

                         

                        I have not had a chance to try the latest Catalyst, as getting 11.12 was messy enough (even in single GPU config) at our site. I still got multi-GPU to get working, but first I gotta finish some things while the computer is working. And this is where I reach the bad part. This cluster would be a joint grant by many research groups at the institute, thus should there be an AMD number cruncher beside a Tesla-based one, it must have great reliability. And truth be told, my experiences with AMD under linux is not the best (to put it politely). I feel that there is a risk to buying AMD, however there will never be another chance when so many groups join forces to create a serious GPU cluster. I could say this is a "now or never" chance.

                         

                        Some people say that their absolute priority is DP performance and ECC. Now... where is ECC exactly on AMD cards? (And this is the part where I would repeat all my questions) Any answers are welcome.

                          • Re: Recommendations for building a GPU number cruncher
                            Skysnake

                            Ok hi,

                             

                            there is already no card with ECC. The now available FirePro are old VLIW cards without ECC. The GCN FirePro should come until Q3, when i remember right.

                             

                            I hope you remember, that you have on a Dual-GPU card a bandwith split of the PCI-E interface. So I think you will NEVER see a Dual-GPU FirePro card. It can be a big performance impact when you do this, because the cards are just linked on the same PCB, but work more like 2 cards with only one PCI-E Interface.

                             

                            Also you have to know, that with more than 2 Cards in a System, you have normal less than 16x PCI-E per Card physical. So you loos also bandwith, and not only a bit.

                             

                            So realy realy think about your bandwith need for the PCI-E Interface per Card.

                             

                            SB-E (Xeons) have 40 PCI-E 3.0 lanes per CPU. With a Dual-Sockel-System, you can have 80 PCI-E 3.0 Lanes in one System, whot gives you five full speed PCI-E 16x Slots. I think this will be the maximum. So please be realy sure that how much memory bandwith you need. All the AMD System have JUST! PCI-E 2.0, so you have only 0,5 GB/s per Lane. PCI-E 3.0 gives you 1,0GB/s per Lane! This is a big difference.

                             

                            Sorry AMD, but you have pass the PCI-E 3.0 train.

                             

                            And there should be some 1 slot watercoolers out there. Sorry i do not know any of theme, but i have seen some pictures with 1 slot coolers. But i think it is more a Problem of the tubes than of the cooler. Just use metal angle valve to solve the problem. But remember, that you have to cool not only the GPU-chip. The voltage transfomer need cooling too! So build one Machin first and realy be rough to this. Otherwise you will get realy big problems when you fill your rack.

                             

                            I hope this helps you.

                             

                            When you need more answers say it please.

                              • Re: Recommendations for building a GPU number cruncher
                                dmeiser

                                Dear Skysnake,

                                 

                                Just a comment: There are plenty of very important use cases in the HPC arena where PCI bandwidth is completely irrelevant. You often launch a kernels that takes minutes or hours to complete.  Doesn't matter if it takes a few seconds to set up your data on the GPU (in a few second you can fill the entire global memory of the GPU even at 1GB/s).

                                 

                                Cheers

                                  • Re: Recommendations for building a GPU number cruncher
                                    Skysnake

                                    I know this very good

                                     

                                    But you should keep in mind, that there are also a lot of problems, where you have much communication, and the cluster should be used for not only one group. So you should always know, if this is important for you or not. After you have bought something, it is to late.

                                     

                                    Also checkpointing etc. can make PCI-E bandwith important.

                                     

                                    @Meteorhead:

                                    There should be no Problem with the extension cards. They should follow the PCI-E specification, and so the card should not "see" that there is a extender/raiser card.

                                     

                                    I think the driver should be your only real problem. I don't know, if there are still problems, when you want to use more then 4 cards in a single machine.

                                      • Re: Recommendations for building a GPU number cruncher
                                        Meteorhead

                                        I very much doubt, that the limit is 4 now, as I have managed to get 6 working nearly two years ago with standard procedures. The question is if that limit was raised to 8 (or something like that), or to an arbitrary size, which I have proposed a long time ago. If I'm not mistaken the issue is that PCI-E addresses are not infinite, infact they are restricted to a very few number of bits, which limits how many GPUs can be addressed independendtly.

                                         

                                        Does anyone know of a reason why 16-32-64 should not work? (Some internal developer's comment would be nice here, just as a reassurance)

                                • Re: Recommendations for building a GPU number cruncher
                                  Meteorhead

                                  There are several groups working on lattice simulations (of various sorts) and there communication is kept to a minimum (several KB-MB movement between GPUs), but this communication has to be made every some iterations. Therefore bandwidth is naturally important, but latency is more important. Firing up LAN to do the communication works also, but it is highly sub-optimal. Infiniband might work also, but that has to go through two host system's memory, which yet again adds significant latency. Single-node multi-GPU would offer the best solution, and my aim would be that communication should hit x4-x8 among the cards, even if all of them decided to fire up traffic at the same time. Dual Xeon configuration (as I would expect) should reach 50-60 GB/sec between RAM and PCI-E BUS both directions.

                                   

                                  We also have applications, where some kernels might run for several days (yes, it's true), but there communication is absolutely miniscule and normal MPI cluster would suffice. As soon as I'll have the time, I would like to make an extension of this surface growth simulation that allows a single surface simulated to span over multiple GPUs. In this case, it takes roughly 0.01 seconds to take a single iteration on the system, and at the most naive approach I would have to move roughly 50-80 MB every iteration among GPUs. Most of the work comprises of finding a way to delay communication, so that is can be made roughly every 0.5 seconds, or to make it asynchronous. In this use case, MPI-type communication practically kills the simulation, not to mention that implementing this using arbitrary number of GPUs to cover a simulated surface... Well I would not go at, not unless it's possible to avoid using MPI. (Avoiding OpenMP also would rock, so making single-thread single-context multi-GPU possible would really be awesome).

                                   

                                  Still waiting for answers or comments on the questions.

                                   

                                  Cheers,

                                  Máté

                                    • Re: Recommendations for building a GPU number cruncher
                                      nou

                                      currently you need one context per GPU othervise it become serialized on AMD. and you can become bottlenecked with CPU so IMHO you should consider using multithreading.

                                        • Re: Recommendations for building a GPU number cruncher
                                          Meteorhead

                                          I understand that right now, I need one thread, one context per GPU used, but for example, don't buffers need to be in the same context when I want to copy their contents to one another? Doesn't this sort of defeat the objective (namely seperating the memory objects of all devices so that they can only be copied through Read-Write operations)? If it were single-node machine, I would use OpenMP, not MPI, but it would be a great ease if I wouldn't have to.

                                           

                                          Is there nobody who has put together a system with PCI-E extension cards and used AMD?