[KERN] Sep 10 14:00:38 bahamut kernel: mce: [Hardware Error]: Machine check events logged
[KERN] Sep 10 14:00:38 bahamut kernel: [Hardware Error]: Corrected error, no action required.
[KERN] Sep 10 14:00:38 bahamut kernel: [Hardware Error]: CPU:13 (17:1:1) MC1_STATUS[-|CE|MiscV|-|-|-|-|
[KERN] Sep 10 14:00:38 bahamut kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000
[KERN] Sep 10 14:00:38 bahamut kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 11
[KERN] Sep 10 14:00:38 bahamut kernel: [Hardware Error]: Instruction Fetch Unit Error: L2 BTB multi-match error.
[KERN] Sep 10 14:00:38 bahamut kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
I had this one that is kind of similar (maybe not really.... just both involve caches and were corrected errors). In my correspondence with AMD support, they said it may be due to my overclocked memory, could also be your case as you indicate that you are running at near unstable values for your system. I would suggest seeing if you can reproduce it at lower (e.g. 2800) or stock RAM speeds. In my case I have seen my MCE only once during a test period of over 7 days running ryzen-kill.sh 8 (modified with -j 2 and to remove apt-get lines, installed those dependencies manually as I'm running Arch like you are) and prime95 at the same time with my RAM at 3066 after receiving RMA processor after having the segfault problem with my first one. My MCE appears to be unrelated to the segfault issue (I didn't see it on the CPU that had the segfault problem and did see it on this otherwise problem free one), have you tried running the ryzen-kill.sh test script? Is your system otherwise stable at this RAM speed (e.g. passes prime95 blend test for at least several days)? That was with X370 K7's F6 bios. I am seeing better stability with the F7a bios (brings AGESA 220.127.116.11b) and am now running the same test set but at a (so far) stable 3200 MHz. I have not yet seen the MCE appear again.
If you don't see the code again then I think nothing needs to be done. These kinds of errors *shouldn't* happen, but of course do on real hardware (hence why we have ECC RAM etc... ), it was discovered and corrected so unless it is a semi-regular occurrence it is probably fine. Can't hurt to contact AMD support directly, but I think they will ask for you to reproduce it at non-overclocked RAM speeds.