I just bought a Threadripper 3970X together with ASRock TRX40 Taichi motherboard. This system is my upgrade from a Threadripper 1950X, which has served me very well. I run Fedora 31 on both systems.
During attempts to get GPU passthrough working on the new system, I have received the following Linux kernel message: (image attached)
feb. 15 22:26:48 oddstr kernel: mce: [Hardware Error]: Machine check events logged
feb. 15 22:26:48 oddstr kernel: [Hardware Error]: Deferred error, no action required.
feb. 15 22:26:48 oddstr kernel: [Hardware Error]: CPU:2 (17:31:0) MC22_STATUS[-|-|MiscV|-|-|-|SyndV|Deferred|-|-]: 0x982010000001010b
feb. 15 22:26:48 oddstr kernel: [Hardware Error]: IPID: 0x0000001813d17000, Syndrome: 0x000000004b00000c
feb. 15 22:26:48 oddstr kernel: [Hardware Error]: Northbridge IO Unit Ext. Error Code: 1, PCIE error.
feb. 15 22:26:48 oddstr kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN
I seem unable to find information on what this actually means. Does anyone have enough insight into the MCA_XX registers and can tell me whether my 3970X is broken? If not, suggestions on where I can ask is also welcome
Best regards,
Odd Skancke
Evidently the kernel is not updated for your machine hardware yet. Might be a month or two.
Windows 10 will run on it fine
Thanks for replying
I should also mention that passthrough does not work for me, while it works perfectly on my 1950X/Fatal1ty machine. I also built my own kernel, v5.5.0 rc3 to see if that made any difference. I know that other people have made passthrough work - but those people were using a different motherboard.
But my main concern was that this hardware error might indicate more serious issues than software support - but I suppose not
Do you, or anyone else here, know what that hardware error is saying?
i suggest if you need to use Linux, I have an instance in a VM which works for development
Linux kernel version 5.5.0 should probably have the MCE fix already, which is why your system manages to boot at least. So that shouldn't be the culprit.
In case this is AGESA (UEFI) + PCIe-related, you could try to pass the pci=noaer and/or pci=nomsi parameters to your kernel to disable some advanced PCIe feature sets and see whether that helps. This can be done either in the boot loader configuration file or interactively in the boot loader prompt by editing the kernel line. Just add the parameters at the end of that line.
At least one person has reported that pci=noaer makes it work on Zen 2 hardware, see [here]. So that's probably your best bet.
If it doesn't, you could also try to pass the parameter mce=off to your kernel to completely disable machine-check exceptions, just in case there is still some trouble with MCEs on Zen 2 when using Linux. I don't think that it would help though.
Running Fedora 31 on Gigabyte Motherboard with Threadripper 3970x without any problems. For installation pass mce=off. Everything else runs fine.