27 Replies Latest reply on Aug 28, 2018 9:39 PM by proper

    Threadripper 1950X Max number of GPUs

    proper

      I purchase Threadripper with hopes that I will be able to use its massive computing power to run docker driven AI applications with each docker having its own GPU.

      My plan was to run 10 or more GPUs at 1x using PCIe splitter. Unfortunately, any X399 motherboard I tried does not run more than 4 in a stable manner.

      I test with following motherboards, all had the same issue, when 5th GPU is added they do not post.

      MSI

      GIGABYTE

      ASUS

       

      Just to clarify all hardware(PSUs, GPUs, splitters, ram) has been tested outside of this build and I am able to use all 10 gpus on another machine

       

      Is there a limit on X399 platform as to a number of GPUs? Can it be reconfigured?

        • Re: Threadripper 1950X Max number of GPUs
          misterj

          proper, please take a look at this post on the ASRock forum of a user trying to get 8 GPUs to work but only getting 7.  My conclusion is that a BIOS/UEFI update is required to allow a larger memory space.  Unfortunately including a link seems to significantly lengthen the time to get my response posted while it is "moderated".  Hope this helps.  ASRock X370 Pro - Can't POST with 8 GPUs - ASRock Forums - Page 1

          Enjoy, John.  

            • Re: Threadripper 1950X Max number of GPUs
              proper

              So far  I found that there is space allocation limit imposed by 32bit system for PCIe management, it is limited to 4Gigs and exceeding that causes problems.

              There should be a setting for Above 4G decoding, this allows increasing 4G limit I see this setting in other boards but not sure if ASUS X399 has it, once the system is done for the day I will check and post an update.

               

              It would be great to get someone from AMD to clarify if 10+ GPUs is something that is possible with their architecture, so I am not beating my head on the wall.

                • Re: Threadripper 1950X Max number of GPUs
                  misterj

                  proper,  I cannot answer your AMD question, but I strongly suspect if you can get the needed UEFI update, the X399  will support many more than 10 GPUs.  The real advantage with X399 is the 64 PCIe lanes.  I suggest you spend some time looking at the forum named bitcoin, you may find a mining rig based on X399.  Also, please look at Newegg, there is a surprising number of MBs specifically for mining, (mostly Intel, sadly).  I found one MB via Google (Intel, mining only, ASUS H370) that supports 20 GPUs.  Mining MBs names tend to start with H.  I suggest you open a Support Ticket with AMD and all your MB vendors asking your questions.  Be  sure to ask AMD what AGESA version (latest I have seen is AGESA!V9.ThreadRipperPI-SP3r2-1.0.0.6) you need to support >4 GB PCIe memory map.  Ask all the MB vendors if or when they will support large PCIe memory map.  Please let us hear what you learn.  Thanks and enjoy, John.

                    • Re: Threadripper 1950X Max number of GPUs
                      proper

                      This is most likely BIOS issue, and hopefully, an update can resolve it.

                       

                      I mentioned this many times before, I am not mining, my use case is completely different and has to do with Ai inference with complex models.

                      Mining boards do not support 100gigs of ram and 32 thread CPU's, that's what I need. On paper, threadripper X399 is the perfect platform to do what I want, but as you can see it does not deliver, not yet at least.

                       

                      I have AMD ticket opened and someone from AMD reached out to me on twitter and offered to help, something I really appreciate.

                       

                      I will post an update once I have more info

                        • Re: Threadripper 1950X Max number of GPUs
                          misterj

                          Thanks, proper.  I knew your were not mining.  I just knew that miners were in need of many GPUs.  I also suspect that they do not need much CPU power as you do, so maybe not much interest in TR.  The first link I posted was a X370 running 7 GPUs.  The user was limited by the UEFI (4 GB) and waiting for an update.  I would again suggest you ask your MB vendors about the UEFI update.  It would surprise me at this date that AMD did not support it, but the vendors must include the AMD update (AGESA).  It is also possible that AMD does not get involved and only the MB vendors need to release the update.  I would suspect your system should support at least 7 GPUs.  Please describe your system in more detail and maybe we can get 7 to work.  Be sure to tell me how all your system is powered and the size of the PSs.  What POST code do you see when it fails?  Is 4 GPUs stable?  Thanks and enjoy, John.

                           

                          EDIT: UEFI setting from the first link I posted:

                          Above 4GB.jpg

                          I do not have this setting in my TR UEFI and this user's (X370) does not seem to work, but we can at least see what it should look like.

                            • Re: Threadripper 1950X Max number of GPUs
                              proper

                              I have a ticket open with Asus and it is slowly getting escalated.

                               

                              To get to 7 GPUs I plugin 3 GTX 690 and one 980 ti, there is a chance it does not boot 25% but it will generally boot with 7 GPUs, it also takes forever to post.

                               

                              Requests to one GPU will usually throw an error in the OS (ubuntu server 16.04)  "PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=400b"

                               

                              This runs off 1600W Platinum Rated PSU, idle power is 280W and max I have seen is 450W. Again we are talking getting it to boot not even running it.  But when I run 7 GPUs on other systems power consumption is below 800W under load with my application.

                              Ram is 32GB 4x8GB 2400MHz , new and tested in another system, all components, GPUs, PCIe risers, RAM are tested in the current system that is running.

                                • Re: Threadripper 1950X Max number of GPUs
                                  misterj

                                  Thanks, proper.  I assume you are aware that TR is "officially supported" only under W10 versions higher that 1703.  Has AMD talked about using Linux?  Is there any way you can experiment under W10?  I assume you are using a 64 bit OS?  What post code do you get when post fails?  Hard to believe your power consumption is so low.  How are you measuring it?  Do any of your MBs have the "Above 4GB MMIO" settings?  Thanks and enjoy, John.

                                    • Re: Threadripper 1950X Max number of GPUs
                                      proper

                                      I am using 64bit ubuntu server 16.04  updated to latest packages. The main reason is that stable CUDA drivers I need are not yet available for 17 or 18.

                                       

                                      My main issue is with getting this thing to post with GPUs, once I get GPUs to actually work on the board I will look at the performance and consider what options I have to get it to run better.

                                       

                                      When this board fails it tends to cycle, meaning it keeps rebooting, it runs through codes and then reboot and starts over again. Sometimes it does stop and codes I have seen are 27, 94, 95

                                       

                                      I currently have Asus X399 and it does not have the setting, but as you pointed out people posted that this does not help increase number of GPUs it can run. There are threads about similar issues on Ryzen chips

                                       

                                      I am not running ongoing compute calculations, GPUs get datasets they need to process via specific model once they are done they wait for next dataset. I a system that distributes incoming requests across the entire array of available units but individual GPUs are not sitting 100% loaded. Power consumption is measured at the outlet where PSu is connected.

                                        • Re: Threadripper 1950X Max number of GPUs
                                          misterj

                                          Thanks, proper.  All your codes seem to be associated with the PCIe.  Here is a DL link to the Aptio 5.x Status Codes PDF:  https://ami.com/ami_downloads/Aptio_V_Status_Codes.pdf

                                          I know little about Linux, but would really like to hear how you make out, so please update the thread.  Thanks and enjoy, John.

                                            • Re: Threadripper 1950X Max number of GPUs
                                              proper

                                              Thanks for looking into this with me Misterj I am sure I will make it out in one way or another just hope that TR platform make it out with me =)

                                               

                                              I am still genuinely excited about this CPU and what it can do, and hope that both motherboard manufacturers and AMD realize features like this are important to consumers that buy there expensive chips and boards.

                                               

                                              So far everyone is telling me to drop AMD and get intel workstation board with dual CPU and it will do the job but I am going to give it some time, I am waiting on responses from AMD and ASUS.

                                               

                                              ASUS has been very poor at handling this, I exchanged around 10 emails with them during week time and 3 calls - they said nothing useful, repeated same things and asked to fill out the same form twice. Promised escalation to technical team multiple times and then sent more emails asking for the same information or suggesting I rotate the graphics cards. This support staff is in the Philippines and they seem to be doing everything they can to stop this issue from reaching people in US who will be able to give solid response.

                                                • Re: Threadripper 1950X Max number of GPUs
                                                  misterj

                                                  You are very welcome, proper.  I am very interested in this but can think of no excuse to make the investment in the HW to play.  I think it is good that you will follow this out with AMD.  Do you still have the other boards or just the ASUS?  Do you know how much MMIO memory is required by your video cards?  I will keep an eye out.  Thanks and enjoy, John.

                                                    • Re: Threadripper 1950X Max number of GPUs
                                                      proper

                                                      I did find some interesting info about PCIe and how boards use them. It seems that main issue with GPU numbers support how motherboards assign addresses to devices, there is limited space available for onboard devices address and most motherboards seems to allow system onboard devices, like a sound controller and network controller, usb controller among others to get devices address assigned before PCIe devices get their turn. This means that PCIe devices get addresses in whatever space is left in the "buffer" and it is clearly not enough for more than 7 devices.

                                                      To test I disabled network controller and sound controller and was able to post with 9 GPUs, I think this confirms that its an address allocation problem.

                                                       

                                                      Quality boards manage address allocation for onboard devices differently to make more address space available for user connected devices.

                                                        • Re: Threadripper 1950X Max number of GPUs
                                                          misterj

                                                          Thanks, proper.  Here is the screenshot from the 8 GPU thread I pointed to above.  W10 enumerates the exact addresses used so you can see all the junk given space.  Over half the space is not GPUs.  I suspect Linux has similar information.  The bottom line is all 4GB is used in the system with only 7 GPUs.  This is why the "Above 4GB" UEFI setting needs to exist, be enabled and WORK!  This, of course, requires a UEFI change.  Have you heard from AMD on the "Above 4GB" setting?  I could be wrong, but I suspect the MMIO area allocation is an OS function.  Perhaps you know a Linux programmer that can get rid of more of the junk for you.  Thanks for posting, please continue so I can keep abreast - love learning.  Enjoy, John.

                                                          PCIeMMIO.jpg

                                                            • Re: Threadripper 1950X Max number of GPUs
                                                              proper

                                                              Keep in mind that this error happens before the bios even get to preparing to load your OS, you may as well just unplug the hard drive(with OS) it won't make a difference, because the system won't post, it is not functional.

                                                               

                                                              I do not believe this is related to "above 4g decoding"

                                                              Above 4g decoding the way I understand it is how VGA devices are mapped in RAM.

                                                              OS needs to allocate memory for where graphics cards data will be stored and it allocates a maximum of 4gb under 32bit limit.

                                                              Enabling above 4G decoding allows the OS use 64bit address map to allocate more RAM for PCIe devices.

                                                              Problems caused by 32bit address map would present themselves by PCIe devices not being detected in the OS or disappearing during use because the system would run out of 4GB ram that was allocated for VGA. This means the system would actually work.

                                                               

                                                              However, the issue I have seems not to be related to RAM allocation limits and presents itself before bios completes loading its systems.

                                                              When motherboard starts it begins creating a map of devices it has onboard so it could communicate with them, these addresses are stored in the special address "buffer" which is very tiny and is designed to just hold device address data. Many onboard devices get address allocated in the "buffer", with remaining space left for peripherals, PCIe devices. That space is limited and what happens in default configuration there is enough space left for only 7 addressable devices. It really does not matter what those devices are, they could be M.2 drives or network cards or GPUs, bios needs to put device address in the "buffer" so it could communicate with this each device and it runs out of space.

                                                              When I disable the network adapter and sound card on the motherboard, they are no longer initialized and address space they used to occupy becomes available and I can plug in two more GPUs

                                                              This is my layman's explanation of what I believe the issue is, considering results of my tests I think I am close to the truth. Must say big thanks to kind people at amfeltec for shining some light on this.

                                                               

                                                              Main problem is still that Threadripper can support maximum of 7 PCIe devices(although there are issues at 7), it has nothing to do with GPUs, you can connect 7 network adapters and 8th won't work

                                                               

                                                              Motherboards that support more PCIe devices go around this issue by using larger address space or moving onboard devices to different address space, so peripherals have more address space.

                                                               

                                                              This can only be solved by BIOS update that gives that capability. I hope manufacturers of motherboards will solve this.

                                                              ASUS support has been complete garbage, I spoke with a lot of people working on my case and keep telling me the same thing and have not escalated this to the department that can actually respond. I am going to have to go after their marketing people on social media and send a letter to the corporate office to get this moving

                                                                • Re: Threadripper 1950X Max number of GPUs
                                                                  misterj

                                                                  Thanks, proper.  You are absolutely correct about not posting versus not seeing the GPUs.  I forgot that aspect.  Now I suspect two different things are going on.  One concerning only the UEFI and one the OS and UEFI.  Perhaps the bitcoin forum can help.  I have a sign-on and will post a question there and see if we can get some answers.  Perhaps this explains why the "Above 4GB" does not appear to work.  It would allow more MMIO memory but not permit the UEFI to process all the cards, so the OS does not get loaded.  I'll let you hear.  Enjoy, John.

                                      • Re: Threadripper 1950X Max number of GPUs
                                        misterj

                                        proper, I stumbled on to this:

                                        BusResourcesPCI.jpg

                                        It is a utility from National Instruments probably used to see if a customer's system will support their MXI-Express product.  This is what my systems reports.  I did not see a Linux version, but you may want to take a look anyway.  If you Google MXIeBusDetect you can find a page with an explanation and DL.  I suspect these two areas are numa node 0 and 1, one for each processor chip.  Enjoy, John.

                                          • Re: Threadripper 1950X Max number of GPUs
                                            proper

                                            I have some progress with this and got 12 GPUs working. System also stopped crashing if more are added it simply does not detect them at all, which probably means VGA buffer is out of space and I need "above 4G decoding" to keep going.

                                             

                                            I was able to get there, by disabling everything I could including USB controller. I also found that last PCIe port at the bottom of the board could not be used and always results is errors. I think it has issues creating Group for devices when that port is in use.

                                             

                                            There was also another setting on the board that I found clever, called "Enumerate all IOMMU in IVRS" my understanding is that this setting allows both chips on the CPU to participate in PCI address allocation, without this setting only one CPU of the two manages PCIe.

                                            This maybe the reason for this issue but I am too tired to do another reset and try to just test that setting alone. I will run this test later

                                             

                                            in conclusion I think that if "above 4G decoding" becomes available I should be able to map more GPUs and get to 16 or more, but 12 is already a good number so I will keep the system.

                                             

                                            Both ASUS and AMD, support is not  very supportive - bit disappointed, two weeks go by and they could not suggest anything but buying mining board.

                                            1 of 1 people found this helpful
                                              • Re: Threadripper 1950X Max number of GPUs
                                                misterj

                                                Thanks, Proper.  The only useful response on Bitcoin:

                                                ASUS B250 Mining Expert m/b will do 13 GPU's and another 6 P series cards for a total of 19.  They can be had for around $100 new and $50-$70 used on Ebay.

                                                I do not know what P series cards are.  I will ask some question of the responder.

                                                You are making good progress.  Enjoy, John.

                                                 

                                                EDIT: Here is the BIOS update video for the responder's Mining board - Interesting.

                                                  • Re: Threadripper 1950X Max number of GPUs
                                                    proper

                                                    For some reason, I was not able to post in this thread for few days. Said it was blocked, moderators said they never blocked it. Bugs I guess

                                                    Nvidia has Quadro lineup for workstations that do CAD mostly and need specialized GPUs. Those are called Quadro P, you can lookup Quadro P4000 on amazon

                                                     

                                                    At this point, I am convinced that motherboard address allocation is not an issue. What happened was all USB ports gain an address in the buffer and if you count expansion options for USB headers, there are over 20 address spaces.  Once those are loaded there is not much address space left. So you disable USB controller and gain over 20 ports. But cant connect a keyboard - bummer.

                                                    , so you need to use USB expansion card or I will try to get it to work with just enable USB 3.1 controller which is separate and probably has 6 ports total.

                                                     

                                                    Bottom line is I can get it to post and boot with 16GPUs but OS will not see past 12 and I hope that is 4g decoding issue. Asus did not implement 'above 4g decoding'  on this board yet but Gigabyte and other manufacturers did. I opened another ticket with Asus to get status on that but their support is complete garbage when I speak with these reps they employ in the Philippines its clear they have layman level experience with an actual motherboard and simply rely on guides written for them to go through the steps.  I have a ticket open with them for 3 weeks, nothing was done, I requested that manager call 3 times, never heard from them. Last week they have just been ignoring my request, to sum up - their support is useless and at this point, I strongly suggest you stay away from Asus if your system is in any way out of the norm, they just won't help you.

                                                      • Re: Threadripper 1950X Max number of GPUs
                                                        misterj

                                                        Thanks, Proper.  You are making great progress!  I saw your post on the ASRock forum.  Sorry about your not being able to post - wonder if others are having the same problem.  Does your board have a PS-2 KB/Mouse port - mine does?  You could also consider an RDC (Remote Desktop Connection).  I have seen several complaining about ASUS support.  I had some support trouble with one company and wrote the CEO.  It was fixed really quickly!  The post on ASRock's forum about AGESAs above 1.0.0.2 supporting Above 4GB has no meaning because the various groups of processors have different number schemes and, of course, just having a supporting AGESA means nothing unless the MB vendor supports it.  I appreciate your keeping us up to date and I will check out the P series.  Thanks and enjoy, John.

                                                         

                                                        EDIT:  Please take a look at the Bitcoin thread I pointed to above.  It has a couple more entries.

                                                        • Re: Threadripper 1950X Max number of GPUs
                                                          misterj

                                                          Proper, please take a look at my Bitcoin post.  There are some new posts.  Enjoy, John.

                                                          • Re: Threadripper 1950X Max number of GPUs
                                                            manylines

                                                            I think 12-13 GPUs is a Windows limit. Which ASUS board are you using?

                                                              • Re: Threadripper 1950X Max number of GPUs
                                                                misterj

                                                                manylines, I think that limit is caused by the memory allocated (32 bit, I think) to MMIO (Memory Mapped IO).  Newer boards now have a UEFI setting (something like 'Above 4GB') allowing MMIO to go as far as 64 bits.  I think some block chain miners have rigs with more than 20 GPUs.  The Bitcoin forum discusses these machines.  Enjoy, John.

                                                                • Re: Threadripper 1950X Max number of GPUs
                                                                  proper

                                                                  I posted that Ubuntu is used not Windows.

                                                                   

                                                                  I was able to get 14 and even 15 GPUs working on Threadripper platform however it is very unstable, rebooting the machine can cause problems and you end up spending an hour playing with it to get it to work again, it's not acceptable. Best performance I had was with 12 GPUs and even then it was bit shady, sometimes it would reboot multiple times before it posted and booted up. Issues come from bios and noone of the motherboard manufacturers cared about my issues when I reached out to find solutions.

                                                                  Threadripper is workstation CPU and there are no workstation grade motherboards for this platform, all we have are gaming boards with a bunch of lights and lack of any support. I decided to skip on Threadripper platform as enthusiastic as I was and current CPU and board will be used for hosting some VMs. I will update how that works out.

                                                                   

                                                                  I went with quality Intel workstation boards and comparable 12core 24thread CPUs, they did run more expensive and they handle same 12 GPUs each but they are just rock solid, you turn them on and they just work - every time, no fiddling with settings - no disabling USB controllers no rolling the dice.