1 Reply Latest reply on Sep 16, 2017 7:13 PM by ttbek

    Meaning of L2 Cache Error on processor AMD Ryzen 7?

    asmgeek

      Hi,

      Processor speed: 3GHz no overclocking.

      I got a error message on ArchLinux  4.12.4-1-ARCH:

      Aug 07 08:22:24 muon kernel: [Hardware Error]: Corrected error, no action required.

      Aug 07 08:22:24 muon kernel: [Hardware Error]: CPU:8 (17:1:1) MC2_STATUS[Over|CE|MiscV|PCC|-|-|Poison|SyndV|-|CECC]: 0xdb30480b001f102c

      Aug 07 08:22:24 muon kernel: [Hardware Error]: IPID: 0x000200b000000000, Syndrome: 0x0000000000000000

      Aug 07 08:22:24 muon kernel: [Hardware Error]: L2 Cache Extended Error Code: 31

      Aug 07 08:22:24 muon kernel: [Hardware Error]: cache level: RESV, tx: RESV

       

      My RAM is DDR4-3200 speed 2933, unable to get the actual timing without a unstable kernel.

      I tested to increase the RAM speed to ~3000 and that's when the problem occurred.

      I know there may be a problem with AMD ryzen, I don't know if it is related to the segfault problem : gcc segmentation faults on Ryzen / Linux

       

      What 's the meaning of this code?

      Is there anything to do?

        • Re: Meaning of L2 Cache Error on processor AMD Ryzen 7?
          ttbek

          [KERN] Sep 10 14:00:38 bahamut kernel: mce: [Hardware Error]: Machine check events logged

          [KERN] Sep 10 14:00:38 bahamut kernel: [Hardware Error]: Corrected error, no action required.

          [KERN] Sep 10 14:00:38 bahamut kernel: [Hardware Error]: CPU:13 (17:1:1) MC1_STATUS[-|CE|MiscV|-|-|-|-|

          SyndV|-]: 0x98200000000b0151
          [KERN] Sep 10 14:00:38 bahamut kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000
          [KERN] Sep 10 14:00:38 bahamut kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 11
          [KERN] Sep 10 14:00:38 bahamut kernel: [Hardware Error]: Instruction Fetch Unit Error: L2 BTB multi-match error.
          [KERN] Sep 10 14:00:38 bahamut kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

           

          I had this one that is kind of similar (maybe not really.... just both involve caches and were corrected errors).  In my correspondence with AMD support, they said it may be due to my overclocked memory, could also be your case as you indicate that you are running at near unstable values for your system.  I would suggest seeing if you can reproduce it at lower (e.g. 2800) or stock RAM speeds.  In my case I have seen my MCE only once during a test period of over 7 days running ryzen-kill.sh 8 (modified with -j 2 and to remove apt-get lines, installed those dependencies manually as I'm running Arch like you are) and prime95 at the same time with my RAM at 3066 after receiving RMA processor after having the segfault problem with my first one.  My MCE appears to be unrelated to the segfault issue (I didn't see it on the CPU that had the segfault problem and did see it on this otherwise problem free one), have you tried running the ryzen-kill.sh test script?  Is your system otherwise stable at this RAM speed (e.g. passes prime95 blend test for at least several days)?  That was with X370 K7's F6 bios.  I am seeing better stability with the F7a bios (brings AGESA 1.0.0.6b) and am now running the same test set but at a (so far) stable 3200 MHz.  I have not yet seen the MCE appear again.

           

          If you don't see the code again then I think nothing needs to be done.  These kinds of errors *shouldn't* happen, but of course do on real hardware (hence why we have ECC RAM etc... ), it was discovered and corrected so unless it is a semi-regular occurrence it is probably fine.  Can't hurt to contact AMD support directly, but I think they will ask for you to reproduce it at non-overclocked RAM speeds.