16 Replies Latest reply on Jun 21, 2011 1:27 PM by GPGPU_enthusiast

    8 GPUs work, 10 GPUs fail -> What's wrong?

    mrbpix

      As I have previously documented, 8 GPUs (4 dual-GPU HD 5970) works perfectly fine under Linux with CAL IL. I decided to try 10 GPUs but the X11 fglrx segfaults in xdl_x750_atiddxDisplayScreenDestroy():

      http://blog.zorinaq.com/?e=46

      AMD: please fix your drivers :-)

        • 10 GPUs: fails; 8 GPUs: works
          moozoo

          This might be relevant

          http://fastra2.ua.ac.be/?page_id=214

          No doubt a similar problem would occur with AMD cards.

           

            • 10 GPUs: fails; 8 GPUs: works
              mrbpix
              The Fastra guys encountered problems due to PCI memory mapping allocation which prevented them from POSTing. But in my case the BIOS POST successfully, the kernel sees all 10 GPUs, and fglrx.ko loads without errors and also sees all 10 GPUs, so I don't think I am hitting the same problem. I theorize that the X11 SIGSEGV I see is caused by the fglrx driver using data structures hardcoded to handle a maximum of only 8 GPUs... Comments?
            • 10 GPUs: fails; 8 GPUs: works
              Meteorhead

              Hi!

              I do not know why that problem arises, and forgive me for asking something not quite related, but what is your machine layout?

              I have been looking for possibilities of creating dense GPU clusters similar to NV Tesla 1U GPU racks. I have created a topic under general discussions a while back, but zero people answered or commented. The reason I ask is becuase we have 3 5970s in a 3U machine and they fry each other.

              I saw that the reference 6990 will use a different cooling system more suited to 3U installations, but it is too costly for my group to buy 3-4 of them only to test out cooling issues.

              If anyone would refelct on the topic, it would be real nice.

                • 10 GPUs: fails; 8 GPUs: works
                  mrbpix

                  I use flexible PCIe extenders to space the cards out by about 2cm which is really helpful for cooling. They don't overheat despite operating 24/7 at 100% load at ambient temp of 25 C:

                  http://blog.zorinaq.com/?e=42

                  I also have another 4 x 5970 machine with the cards in regular double-width slots (no extenders). It is possible to operate them without overheating by having lower ambient temps (20 C), using a layout where the motherboard is horizontal and the cards not screwed to a chassis to give them some flexibility due to mechanical imprecisions of PCIe slots, and by inserting small plastic elements between the cards to force a ~5mm gap between the top edges of each card.

                  I will certainly try 4 x 6990 myself when they come out:

                  http://www.fudzilla.com/graphics/item/21635-amds-antilles-card-pixellized

                    • 10 GPUs: fails; 8 GPUs: works
                      Meteorhead

                      I was hesitant (and still is) not to screw the cards onto the chasis, becuase the cards are so heavy I believe the PCIe slots would not hod them, specially if they are flexed apart. (The motherboard is quite expensive, so I really wouldn't want to break it)

                      • 10 GPUs: fails; 8 GPUs: works
                        empty_knapsack

                         

                        Originally posted by: mrbpix

                        I will certainly try 4 x 6990 myself when they come out: http://www.fudzilla.com/graphics/item/21635-amds-antilles-card-pixellized

                         

                        My tests with 6970 ends very disappointing -- 6970 being slower than 5870 for MD4/MD5/SHA1. If 6990 will be designed in the same way as 5970 was (i.e. was downclocked 5870x2 and now will be downclocked 6970x2) it'll be just money waste. Perhaps it'll be possible to tune code for 69xx family a bit and so 6990 will be faster after all but there definitely won't be 2.5x speed-up as it happens with 4870x2 -> 5970. Not talking about 2x8-pin PCI-E for 6990 == higher power consumption than 5970.

                          • 10 GPUs: fails; 8 GPUs: works
                            Meteorhead

                            I do not know how a 6970 can be slower than an 5870. It must be some issue with the 4-way VLIW packing, or some magic unfortunate constellation that causes these codes to be broken slower on 4-way VLIW. 6970 Has more raw power in every single aspect.

                            If I had the money, I'd definately try the 6990, but most likely 7xxx cards will rule major, since they will be on lower on 32nm. I definately like the cooling system of the 6990. It suits lot better for multi-gpu installation.

                              • 10 GPUs: fails; 8 GPUs: works
                                mrbpix

                                Meteorhead: you misunderstood me, unscrewed cards are not a problem precisely because the motherboard lays horizontal, so the cards are vertical and put no stress on the slots.

                                diepchess: actually a 890FXA-GD7 is designed to accommodate up to 4 double-width cards. But, like others said, I use flexible PCIe extenders which allow me to connect up to 6 cards (I only tried 5 though).

                                  • 10 GPUs: fails; 8 GPUs: works
                                    Meteorhead

                                    I do understand that the cards are uprgith, I have a 3U housing with standing cards in it too, it is just that the cards seem awfully heavy (not as heavy as the 4870X2, which is almost like a brick), and I do not feel comfortable letting the PCIe slots hold them, because they do stretch far behind the slot (not to mention that support under the card is not in the middle.

                                    Do you say it is healthy in the long run to unscrew them and put something in between them to stretch them apart a little?

                                      • 10 GPUs: fails; 8 GPUs: works
                                        mrbpix

                                        They cannot tilt sideways because they make physical contact between each other (or more precisely between the plastic objects I use to create gaps between them).

                                        And they cannot tilt along the other axis (even though they extend far beyond the end of the PCIe slot because an x16 slot applies enough mechanical force to prevent the tilt in that direction.

                                        So yes it is safe to not screw them in.

                                      • 10 GPUs: fails; 8 GPUs: works
                                        diepchess

                                         

                                        Originally posted by: mrbpix Meteorhead: you misunderstood me, unscrewed cards are not a problem precisely because the motherboard lays horizontal, so the cards are vertical and put no stress on the slots. diepchess: actually a 890FXA-GD7 is designed to accommodate up to 4 double-width cards. But, like others said, I use flexible PCIe extenders which allow me to connect up to 6 cards (I only tried 5 though).

                                         

                                         

                                        Thanks for answerring! Riser card indeed clever idea, 

                                        assuming your software doesn't much of need bandwidth to pci-e.

                                         

                                        Mind sharing a photo how it looks like to put so much weight onto mainboard,

                                        or did you produce a solution to take away weight towards mainboard or is the case you use providing a solution to this?

                                         

                                        Mind sharing a few photos of how it looks like?

                                         

                                        Regards,

                                        Vincent

                                    • 10 GPUs: fails; 8 GPUs: works
                                      diepchess

                                       

                                      Originally posted by: empty_knapsack
                                      Originally posted by: mrbpix

                                       

                                      I will certainly try 4 x 6990 myself when they come out: http://www.fudzilla.com/graphics/item/21635-amds-antilles-card-pixellized

                                       

                                       

                                       

                                       

                                      My tests with 6970 ends very disappointing -- 6970 being slower than 5870 for MD4/MD5/SHA1. If 6990 will be designed in the same way as 5970 was (i.e. was downclocked 5870x2 and now will be downclocked 6970x2) it'll be just money waste. Perhaps it'll be possible to tune code for 69xx family a bit and so 6990 will be faster after all but there definitely won't be 2.5x speed-up as it happens with 4870x2 -> 5970. Not talking about 2x8-pin PCI-E for 6990 == higher power consumption than 5970.

                                       

                                       

                                      Would be most interesting to know the reason why and when it has been fixed.

                                       

                                      Now of course MD* + SHA* are a bit dated hashing algorithms as you cannot speedup the time that a single hash takes in an easy manner.

                                       

                                      I had already proposed to write down requirements for a new cryptographic hashing algorithm that can exploit in easier manner bigger vectors and/or manycore parallellism.

                                       

                                      The cryptographers can get to work!

                                       

                                       

                                       

                                       

                                • 8 GPUs work, 10 GPUs fail -> What's wrong?
                                  diepchess

                                   

                                  Originally posted by: mrbpix As I have previously documented, 8 GPUs (4 dual-GPU HD 5970) works perfectly fine under Linux with CAL IL. I decided to try 10 GPUs but the X11 fglrx segfaults in xdl_x750_atiddxDisplayScreenDestroy():

                                   

                                  http://blog.zorinaq.com/?e=46

                                   

                                  AMD: please fix your drivers :-)

                                   

                                   

                                  Hi the motherboard you quote at this blog is the MSI 890FXA-GD7

                                  It has 5 pci-e slots. I don't see how you can fit in more than 3 gpu's that are dual slot in that board. Actually more than 1 seems overkill for this board if i look closely. the 5970 is like 33 CM long, very few boards can handle that.

                                   

                                  How did you end up with 10?

                                  Mind sharing a photo?

                                  Regards,

                                  Vincent