That's why I usually wait 2 years b4 jumping on new hardware designs - this is sufficient time to patch hardware errata.
1 of 1 people found this helpful
I have this issue as well. Due to it and other issues, my Threadripper build has been a giant unmitigated disaster unfortunately. I was extremely excited to do this build. I had hyped up the possibilities that all the cores offered specifically with VMs, Pass-through, Compilation, etc to everyone I knew, and they'd all be anticipating seeing how it turns out. Unfortunately, I had to report to them that my build has been a complete failure due to this.
For a high-end part @ $999 and affiliated motherboard (Zenith) @ $549, this is not acceptable. If I had bought the Core i9 instead, I would not be having this issue. If I go back to my older i7, I don't have this issue. I want to give Team AMD a chance as I appreciate what you guys are doing to up the available core count, but the parts have to work.
Several things greatly concern me right now after I've looked into it further that I am looking for AMD to address:
- The Ryzen had PCI passthrough broken as well at launch, and you had to fix it with AGESA 1006. Since the Threadripper is two joined 8-core Ryzens, why didn't you learn from this? It is extremely disappointing to see this issue repeated on the higher-end more expensive part, where this functionality is most valuable. For a high end processor, this functionality should've been tested and working day 1.
- I've been reading about the Nested Page Tables bug (NPT) in kvm_amd that has been around for...... 9 years? Why? This is an absolutely critical bug; AMD engineering should be assisting with fixing this. Having your options be that either the GPU or the CPU performance terrible in a VM is not an acceptable option and botches the entire potential of high core parts for these applciations. See: 196409 – kvm_amd nested pagetable gpu passthrough performance oddities
- Ryzen had a critical issue with high core loads related to compilation that ended up being a hardware fault. AMD says this is a "performance marginality problem" (what?). Affected users have to get their physical processor exchanged through support. This makes me *really* nervous. One of the first things I thought when I say Threadripper's core count is, man that's going to be great for compiling AOSP. While the article claims Threadripper is not affected, the fact that something like this was just discovered on Ryzen makes me really nervous about anything broken on my expensive Threadripper See: AMD Confirms Linux 'Performance Marginality Problem' On Ryzen - Slashdot
- My expensive motherboard won't even boot with certain common PCIe cards (USB controllers, for example). The torx screws was too short on the socket making threading near impossible. The memory slots are so sensitive that it took multiple re-seats of every DIMM to get it to boot with 128GB. The boot order of my HDDs is randomly lost with a SATA card installed (bugs reported: ROG Zenith Extreme (X399, socket TR4) - info, experience, updates - Page 8 ). This has been the hardest and most frustrating build I've done in the last 20 years.
None of these issues are problems on Team Intel. I want to give Team AMD a chance again... I really do. The Threadripper has amazing potential if the issues are fixed, and makes the price much more affordable. But again, the features required to utilize them have to work. Like the original poster, if I don't see both recognition and traction from AMD on these issues soon, then I will be RMA'ing as well at the end of the window and going with the Core i9. I hope that AMD takes these issues seriously and starts to get fixes in place. Especially because, since you're using the same architecture for Epyc, those processors are going to have a hard time in the server market with things like this broken.
With that said, on to information about this issue. Here is what I've collected from my motherboard. Please pass on to engineering:
- OS: Linux Mint 18.2 Cinnamon Edition (Ubuntu 16.04)
- CPU: AMD Ryzen Threadripper 1950x
- MB: Asus Zenith Xtreme (BIOS 0503)
- MEM: 8x16GB (128GB) Crucial Ballistix BLS4K16G4D240FSC
- GPU: 2x EVGA 1080Ti FTW3 Hybrid
- Other: Inatek USB PCIe (KTU3FR-5O2U, disabled due to BIOS bugs), Supermicro AOC-SAS2LP-MV8 8-port SATA, Asus 10GEth PCIe (Bundled w/ Zenith)
- Pass-through GPU stuck in D3 state, no output. VM hangs. Let me know if there is any more information I can provide.
- 4.8, 4.10, 4.12, 4.13rc6
Versions tried (same result on all of them):
- QEMU 2.5, Libvirt 2.5
- QEMU 2.9, Libvirt 2.5
- QEMU 2.10rc4, Libvirt 3.6.0
- lspci -tv: https://pastebin.com/u1N46P5G
- lspci -vv GPUs: https://pastebin.com/qvpbDV4p
- lspci -vvv (Full): https://pastebin.com/hkZXzeAY
- IOMMU Groups: https://pastebin.com/RhXM9uPt
- KVM/Libvirtd Config: https://pastebin.com/EvJGe5BM
- KVM Log: https://pastebin.com/sb21r4EH
- VFIO DMESG (on VM start): https://pastebin.com/HqUyAT1y
- /proc/interrupts: https://pastebin.com/G8pzScxZ
You can also add USB controllers with Fresco Logic chipsets to the "not compatible" list. System will not POST with it installed (chipset init failed).
I have the exact same problem.
If I use pci=nommconf in grub, the iommu groups are all bunched up together.
Motherboard is an Asrock X399 Fatality professional gamer (wierd name for a workstation board!).
Yep, that's what my Inateck USB PCIe is. I reported that to Asus on their forums and another user confirmed the issue there as well.
I reported the issue to Asrock last week, but have not heard back from them yet.
So far am in the same boat as you, I need IOMMU and passthrough to work otherwise I have to RMA the whole build :-/
Edit: Asrock replied, usual test all the slots, test in another machine etc.
I find the lack of official replies overwhelmingly disappointing. To be honest, time is running out on my 30 day open return thing.
Agreed. I filed a support ticket pointing to this (and the other) threads - hopefully that will draw some attention. I want to like the Threadripper and really appreciate what AMD is trying to do by getting higher core counts more affordable, but the parts gotta work.
Got any reply to your support ticket?
Nope, complete silence.... *sigh*. Same from Asus. No reply or acknowledgement to my official support ticket and they've ignored my forum posts as well. It's sad to see that neither of them seem to know the meaning of customer service
wait for a BIOS update or 5, bound to be a few zillion problems
Problem is, on the consumer side this is fairly niche functionality that has historically been a buggy, ignored mess by manufacturers. A TR/X399 based build is commanding at least $1500, and likely closer to $2,000-2500+ unless you already have ram, case, psu, gpu etc. This isn't works with quirks or known issues. This is totally flat out broken with no known workaround and not a single confirmed success.
The RMA window is only so long, and we need to know the AMD will take fixing this seriously. The track record so far doesn't look great. The Intel side of the fence, while more expensive, doesn't have issues on this. If they are going to piddle around, we'd be better off refunding and buying Core i9. However, there are appealing aspects of the Threadripper - higher clocks, more pcie lanes, lower price. Ideally, we'd all rather just see this fixed.
AMD could do a lot of good will for themselves just by being engaged with the community better. Even if they have other issues they have to fix first, acknowledging the issue and confirming its in the queue for a fix would help. But, instead, not only has this thread gone ignored, but so has my actual support ticket. Worse, my motherboard manufacturer has done the same on both accounts. I didn't pay $1500 for the platform to be unusable AND get ignored by support.
If the way you use your computer depends on this functionality, then the entire machine is worthless right now. In my case, its a gaming, server, workstation powerhouse. I game, I run a media server, I do multi platform software development, I host my own email and a lot of dev tools, I have home automation run through it..... I have a handful of VMs as a result that have to run all the time using different physical resources. I came from X99 for more cores, as I was pretty CPU bottlenecked, and I tried Threadripper for more PCIe lanes / higher clocks. Half of my services can't run on the TR right now with this broken. My old build is better, as it worked. So basically, until this gets fixed I completely wasted $1500, and if its not going to get fixed I need to know to start the RMA/refund before its too late. We all know theres always issues with new chipsets. The thing that makes a big difference is how the vendor handles it, and in AMD's case its been poorly - being ignored at this price point is not acceptable.
The high-end is a thin market but they also provide a whack of useful information for the mainstream market as prices rot
Almost every machine I have owned has had several BIOS updates over the life of the machine. Sometimes it takes a year or two to get caught up or to otherwise figure it out
Thing is though, the competition got it figured out. Out the gate.
My main beef right now is not that there's are bugs, but that I can't get any response from AMD or ASUS, official support ticket, forum, or otherwise. Customer service makes a big difference in these situations, and too many companies have crappy support these days.
I've been building custom rigs for some 20 years now and I've never had a disaster of *this* magnitude. But I have had my share of buggy builds. I built the rather ill-fated eVGA Z77 FTW in 2012 on launch day. But - here's a key difference: I, as an early adopter, discovered a critical bug in one of the on-board SATA controllers where it would lock up and detach the drives under steady load. I reported it on the forums with replication instructions, and within a few hours, Jacob @ eVGA acknowledged the issue. About a day or so later he posted an update saying that engineering was able to replicate the problem with my instructions and they were working on a fix. Several days after that, a BIOS update was released that fixed the issue. The key thing here was, they were very communicative throughout the whole process, and kept the affected customers in the loop.
*That* is doing the best you can with customer service. What's fueling the fire for me here is:
1) Official support tickets filed to Asus about PCI issues. No response in almost 2 weeks
2) Forum post bug reports to Asus in their thread tracking them. Asus posts before and after my posts responding about other issues, completely ignores mine. I post again asking if they could confirm they received the information, and they continued to post about other issues still ignoring it.
3) Filed official support ticket with AMD. No response thus far.
4) No acknowledgement on this post, when AMD staff acknowledges other threads in this forums regularly.
5) Tweeted @AMDRyzen asking if they could get someone could look into this, no response.
Clearly there are customer service issues here, and its all the more worrisome when its for lesser used functionality. What's it take to get a response? Negative reviews? Bad press? Heck, I glance at negative reviews on my motherboard right now and Asus is posting manufacturer responses to them on both Newegg & Amazon. But they can't respond to my support ticket? It shouldn't have to be that way. Treat your customers right, and they will be understanding when issues arise. Unfortunately, that's not what's happening here.
It's easy to defend AMD on this when you're not the one with a $1500+ unusable for its given purpose machine, but if you were us you'd be upset too, especially if you've experienced first hand how hard it is historically to get vendors to fix issues in this area. If you don't make noise about niche functionality, it unfortunately tends to remain forever broken; it doesn't fix itself. I don't understand the fanboy mentality sometimes. This thread is full of legitimate complaints. It's not made just to beat up on AMD.
1 of 1 people found this helpful
right well i may know someone that knows somone at asus. i will see if i can can get something going atleast. i am sorry for your troubles i feel your pain trust me..
I like EVGA, they gave me a no BS UEFI BIOS upgrade for some video cards that allowed me to use secure boot etc.
So I continue to use their video cards, which have been rock solid in my experience.
EVGA even provides thermal material when a card overheats however only one old GTX 260 ever fried it's TIM
I use AMD processors and video cards but Asus has been a jerk for only allowing a UEFI VBIOS to be installed on certain models their motherboards, which is unethical and illegal
lol your obviosly not from america. all bios is allowed in asus and usually first to deliver. as for the other garbage you said makes no sense .. troll go away..
nvidia and you spek thermal. yup they are hot and burn up .. amd likes heat and run better hot nvidia fails when hot. have a nice day
I have an Asus HD 7870 and it has not been a problem with heat. The Asus cooler seems to be able to handle the load with even demanding games.
The blower fans seem to have the worst record for survival, my EVGA GTX 260 had one of them
This right here, hits the nail on the head!
its new all sounds normal here. mmmyea well hang in there im sure they are working very ******* a fix during the labor day weekend... sadley i learnt my lesson on ryzen to wait 1 or 2 months to get it.. shity answer but hang tuff..
Thanks for your information and all of your hard work on this.
I was wanting to do a Threadripper build, I definitely want to move to Linux on that build.
I might just go Intel instead now, although I really do not want to.
AMD have completely dropped support for any more Windows 64bit GPU Drivers, 5 years before the end of Windows 64 Bit Extended Support, since AMD Crimson 17.7.2 Release, although I am not sure how well known this is. This is what AMD Support are telling me. They have not made any Press Releases about it that I can find.
Windows 7 64bit will be end of Extended Support soon.
That will force me to move to Windows 10 64Bit, which I consider totally unacceptable to me.
My existing Intel Motherboards do not have full driver and software support for Windows 10 64bit, for example.
All of the above including lack of AMD feedback on explanation for AMD RX Vega / FE Gaming Performance versus Power Draw after launch reviews is not great.
I forgot to mention that Nvidia continue to support Windows 8.1 64bit as of yesterday.
I find Windows 64bit with Classic Start Menu addon for desktop is o.k. until I get on to Linux.
In Windows 8.1 64 bit I can at least have some control on what updates are installed up front (provided I can find out what they are doing) and prevent automatic installation of updates completely.
Windows 8.1 64bit has not so far turned off Norton Internet Security at start up or constantly changed my primary browser or privacy settings after updates. Adblockers seem to work still.
I would purchase the "Enterprise" version of Windows 10 64bit but that does not seem to be an option for ordinary consumer and anyhow I do not trust Microsoft any more for my primary OS because of the Forced GWX Program and hidden update to Windows 10. My parents Windows 8.1 Laptop remains bricked since this time last year for example, with no way back to Windows 8.1 64bit.
ASRock has released a new bios. Still no joy.
1 of 1 people found this helpful
I'm not surprised. I am fairly certain the problem is in AGESA code, which means the fix would have to come from AMD. I think that means the soonest we could see a fix is 9/25 when they have they roll out the update that includes NVMe Raid. AMD_Robert has acknowledged this issue on Reddit and confirmed someone is actively working to fix it, but that they don't have anything to share as far as timeline.
My RMA window is up next Friday, and 9/25 is also when the 7980XE comes out. I am debating what to do. If I were to switch to Intel, I'd rather get the 16 or 18 core model than the 10, meaning I'd need to wait until 9/25.... but if I do that, I'm past my RMA window as well. Not sure I want to put everything back on my old board for a week and then do it again either if I RMA then switch. But if I wait longer then the window and it's not fixed, then I'd have to deal with reselling this and the Zenith which is a pain.
1 of 1 people found this helpful
I´m in the exact same boat;
1) Wait it out and hope for the best; with a significant probability of the issues not being fixed, and being stuck with a unusable CPU/platform (for my use cases) and financial loss.
2) RMA and go for the competitions 16-core and pay the "Intel tax".
Option 2 is less risk financially for my case, even including the extra labour it incurs and upfront costs.
Too bad, I was excited for the reduced costs for 16 cores
Just setup a new gentoo box with 1950X and Gigabyte Designare EX motherboard, but I think the issue is same as the one mentioned. Just want to check what's the progress from AMD?
Any fix or update?
As KVM uses VFIO-PCI to ask the kernel to flip the bit to reset the secondary bus on the PCIe bridge controller responsible for the GPU an unintended side-effect occurs in firmware that renders some of the hardware bridge IP block registers out of sync with the values exposed in the 4k PCI config space for the bridge. This is at the very least registers 0x19 and 0x1a (controlling secondary and subordinate bus IDs) but is not limited to those registers. This reset is used during boot of the VM and, although it can be skipped, is important for the correct functioning of the VM during shutdown/reboot.
When using lspci -v this will cause the card to show up with "rev FF" and "unknown header type 7f".
The issue seems like it should be fixable in firmware but there are also several temporary software workarounds available, please see this thread on Reddit for more info:
If you have any way to pass the first paragraph and the link to the AMD Threadripper firmware team please do so as it has been very difficult to get any feedback on the specifics of this issue.