cancel
Showing results for 
Search instead for 
Did you mean: 

Processors

riveryeti
Adept I

Bad memory channel - how to test if mobo or CPU IMC (TR 3960x)?

My build:

ASRock TRX40 Creator | AMD TR 3960x | CORSAIR Vengeance LPX 32GB RAM (CMK64GX4M2D3000C16) | 2x EVGA RTX 2080 Super Hybrid | 2x Intel 660p NVMe | 2x Toshiba SATA HDD | Win10x64

My problem:

TLDR: I can't get memory recognized on slots A1 and A2 of the motherboard and I don't know how to tell if I have a bad mobo or a bad IMC on the CPU. Initially all slots reported RAM but system wasn't stable until it threw a Memory PMU training error after I went from default 2133MHz to 3000MHz (XMP 1) then back to default again.

I have tried multiple sticks of RAM in these slots. All other slots work (and all RAM works in other slots), but with configurations of 2 to 8 DIMMs (all the same RAM from the same batch) A1 and A2 give me "Memory PMU Training error at Socket 0 Channel 2 DIMM 0 & DIMM 1" (when both are occupied) or "Memory PMU Training error at Socket 0 Channel 2 DIMM 1" (when only using slots A2 and B2 per the Memory Configuration page of the motherboard manual for 2 sticks of RAM.

Initially I populated all 8 slots with RAM and benchmarked at 2133MHz. Then when trying to run a SfM benchmark (intended use of this machine) I got an unexpected reboot partway through. Tested the RAM overnight with WMD and came back to a frozen system in windows. Rebooted and event viewer said all the RAM was fine. Loaded XMP profile 1 (3000MHz) and benchmarked great with Passmark (99th percentile,  6778 total, 43468 CPU, 2908 Memory). Tried the SfM benchmark again and got an unexpected reboot partway through again. Reloaded defaults and BIOS finally threw the Memory PMU training error. The system was never stable at 2133MHz or 3000MHz until I got the error and A1 and A2 were disabled. Since they became disabled, I see Memory PMU training error any time a stick is in A1 or A2, and I have never seen any stick of RAM work in them again. 

Since BIOS threw the PMU error I haven't had any system freezes or reboots. I can populate all six other slots of the motherboard and run at 3000MHz (XMP profile 1) for days without an issue. Any time I put RAM in A1 or A2, XMP won't stick, BIOS cycles several times, and memory drops to 2133MHz with PMU error (even if only 2 sticks - in A2 and B2). After giving up on this channel (channel 2 apparently?) I gradually filled RAM and tested at 2133MHz and 3000MHz for C2/D2, C1/D1, and finally B1&B2 and with all configurations I am successfully running at XMP profile 1.

Is it possible to test if it's a bad mobo or IMC without swapping out another one of either (or both)?

65 Replies
mantisman13
Adept II

Not much useful in event logs. Best I get is from looking at mini dumps, but that doesn't give much help either. Sorry about the mail comment, just that you were missing things me and another had already posted.  I have tried to run Ryen Master, but it needs to have VBS enabled and I found an obvious setting in my bios for that yet. It may be something that get sets if you agree to the overclocking statement in the bios that then will void my cpu warranty which I'm not willing to do, especially now.  I agree that RM would yield measures that are more pure, as they are not having to go through the system layer, but I'm not sure for the issue I'm having that really will matter much. No reason to distrust the voltage reading I'm getting out of cpu-z or HWiNFO64. 

mantisman13, SVM (Secure Virtual Machine) must be Disabled to run RM.  I only enable it when I run Hyper-V (W10 virtual machine):

pastedImage_1.bmp

Screenshot from my system.  SVM should not get set/reset for any reason except User action.  I do not need to agree to any OC stuff to set SVM on/off.  RM is also the best way to mess with most BIOS items.  Please notice that I have 'Power Supply Idle Current' set to 'Typical Current Idle'.  This seemed to help when this first popped up.    If you do not mind, please compress your Minidump folder and attach it here.  Enjoy, John.

Thanks MisterJ and @drdocumentum

I have located the setting is the ASRock bios, but haven't yet tried that out with RM.  I did a bunch of testing with the DImm timing and Fabric timing over the long weekend. Setting the fabric timing to 1600 with 3200 for the dimms would work, but was unstable. I would end up freezing and have to power cycle, not BOSD. I believe I have this same issue when I was first getting the system setup and found some thread that suggested that under clocking the fabric would help. I think that is why I was running at 1500. At 1500, I'm rock solid now with the 3 way mem running at these setting.

pastedImage_1.png

My voltage is set to 1.35 but is getting reported by HWiFO64 a bit higher at 1.366. So maybe getting RM going I'll see a different reading.   At any rate, I sprung for another ASRock board, a Creator is all I could get my hands on, and a 3960x for a new build up. Should have all in hand in a few days and I can test out my 3970x with the new board and see if I can get 8 slots working with the same memory settings. I really don't want to tear it all down. I have a raid 10 of 6 SATA drives, and the last time I had an issues with one of SATA cables, the drive id's got messed up (might have been due to hot swapping SATA drives on a hot swap port around the same time) and it took a week for the raid to rebuild, and that was before I even had any real data on it. I'm honestly am not sure how well transplanting them to a new board will go over and think it might be easier if it's the CPU. Either of you guy have any experience with how AMD raid recovers itself when moved between boards? Same question for my NVNE drives that are Raid 0. But at least with those I think I can pull one and let it just rebuild on to a blank as either will have all data. With Raid 10, the data is set up as 3 sets of mirrors. I'm afraid that if the drive id somehow get shifted or mixed up, the raid won't see it's members and wont come back up. I feel like it should work so long as I make sure that each drive goes back to the same SATA port id that it's currently on, but I just have to wonder as I've seen the ids shift before. 

mantisman13, I think I have told you that I do not believe HWiFO64.  I have moved RAIDs between MBs of different vendors with no problems.  You should not even need to stick to the same ports.  As always have a fresh backup available.  All my RAIDs were created my the same AMD RAID system.  A RAID0 has no redundancy, so if you pull one drive, it is marked FAILED and all the data is lost.  I am anxious to hear the results of a new MB and the old processor.  Thanks and enjoy, John,

Back to original subject.  I have RMAed my 3970X for a bad memory controller, but still can only run in triple channel mode.  Something else is broken on my system.  Enjoy, John.

Hi Miserj, That's suck it didn't solve the issue. I've had a very busy month but have been meaning to do an update. Look for it on the main thread shortly. 

0 Likes

Unfortunately I can see that IF correct frequency is defective on Asrock motherboards. I had that lack of stability problem on my Asrock Creator too configuring the IF frequency. Now I can confirm is a design defect as yours also has this problem.

I suppose Asrock knows about this and that is the reason why their BIOS doesn't configure it correctly on "Auto" setting. On my Gigabyte MB, IF frequency is working fine. Make no mistake this IS a defect. As AMD own recommendation is to set the IF frequency in sync with memory clock. This is supported up to 1866 Mhz IF / DDR 3733 memory:

AMD 2019 Ryzen memory selection.JPG

You should return your Asrock.

Something is up for sure.  But it's not just the ASRock IF auto setting. It's defiantly unstable with I manually set it the 3200 and use the XMP profile for that. Down clocking the memory from the advertised stable profile setting has been what has worked for me all along, even before I lost A. I do find it interesting that Corsair no longer seems to offer the kit I bought as a 8x32, they stop at 8 x 16 https://www.corsair.com/us/en/Categories/Products/Memory/Vengeance-PRO-RGB-White/p/CMW128GX4M8C3200C... . To get a 8x32 you have to move to the 3000mhz kits https://www.corsair.com/us/en/Categories/Products/Memory/Vengeance-PRO-RGB-Black/p/CMW256GX4M8D3000C... .  And interesting enough, that is where I seem to be stable. 

pastedImage_3.png

So packages came to today. No case yet or UPS, but I have enough to do a bench build to test the CPU is a similar board (ASRcock Creator)  with a clean windows install. I'll get it up and going with 64gb and a 3960x and then bring over the rest of the ram, then swap out for the 3970x to see if it starts to fail like it is now. If it does, then I'll look to AMD for an RMA, otherwise ASRock as already said to start the RMA process. I guess if the creator fails to do full 8 way it may yet require a call to Corsair before both boards go back and I try out anther maker. But I have otherwise been very pleased with the Tiachi, so willing to give it all the benefits of investigation I can manage. 

Btw, I came across this guide from Corsair in my searching. It might be helfull to anyone else looking at these issues. 

https://www.corsair.com/corsairmedia/sys_master/productcontent/Ryzen3000_MemoryOverclockingGuide.pdf 

0 Likes

It seems that memory faster than DDR4-2400 seems to all have XMP profiles, JEDEC reaches DDR4-3200 which is what I use with my R5 3600.

DDR5 is probably a good ways off.

0 Likes
mantisman13
Adept II

So it's been about a month and I still am stuck with a MOBO that will not work with A channel memory. I have however satisfied myself that both all of the memory chips are fine as well as the 3970x CPU. I dumped some more cash and built up a nice 3960x box with a ASROCK Creator board. If I could have gotten another TiaChi I would have for apples to apples, but this was close enough. This new system was a bit simpler. Air cooled, just a Raid 1 set of 1TB NVE4 for system, no massive storage raid. Used the same brand and type of graphics card, So while stripped down, basically the same drivers for both systems. Some differences with windows. My problem computer is activated Win10-Pro for workstations. The new one just plain Win10-pro and not activated.  Aside from that, they were at the same major build and feature levels for my tests. Both systems had up-to-date bios and drivers. I built the new system up and ran it for a while with the 3960x and tuned it for thermal performance. I could run it with the automatic bios setting but it would get above 90c fast with folding@home. I found under voting and running at 4000mhz, same as I'm doing with the 3970x works well to keep temps around 70c at 90% load and still avoid too much fan noise. With the 3960x, I validated that I can run ALL 256GB at 4x at 3200mhz using the profile provided without any issue. I could do this with my static clock speed or letting the bios dynamically adjust (just with higher temps). I then swapped in my 3970x cpu from my original system and repeated all my tests. Everything worked fine using 256GB in the new system with the older CPU and ran many stress test for many hours as well as a full 24 hr of folding loads. Put things back and did one more full mem test with the old system to end up with the same results. The fact that I get the memory training error before the bios even loads I think is a key tell. So I conclude the issue is with the MOBO. I have asked ASROCK if then can send me a replacement so I can swap with minimal down time and it's crickets. They have indicated I have to fill out a web form to do an RMA and that form says no cross ship. I just can't have the system down and wait for them to look at it and then tell me it was damaged in some way that gets them off the hook. Not pleased with their support at all. Id be happy to pay for the board and get a credit on return, but doesn't seem like they will do that. I'd like to get a new board to swap, but those seem to still be unavailable due to all of the supply chain issues. So I live with this for now. I'm running somewhat more stable after a few windows system repairs, likely from all the hard crashes. Just went 9 days without a freeze up, but that is still unacceptable and at some point I will want all the mem. The new system is rock solid like this one used to be and running a bit faster due to the higher speed ram settings that I can use with the 64GB set. 

0 Likes

Last MB to crap out I had it replaced no problem under warranty. 

I use an Intel 665p 2TB SSD, a criticised them as it is singled sided so more chips on the bottom could have made a larger capacity model. The 665p is not the fastest but it is still easily 5x faster than a SATA SSD.

I have developed several stress tests for my own use in my studio. Computer chess is more demanding than folding or digital coins. Chess enthusiasts often use 128GB of RAM or more. Chess does not use float() at but then comes Leela which uses the Turing logic.

0 Likes

When you RMA your MOBO, did they send you a new one that you could swap and return the bad board, or did you have to tare down you computer and send it in and wait? That is my issue. Through the email based on my description they have pointed me to RMA it from the start, so I can't fault them there. But I didn't want to tear down the computer and have it turn out to be the CPU, so I went deeper and had enough other reason to want to build a second box it was a good way to be sure. So now it's mostly about preventing down time and when asked if they could just send me a new board they never responded. they may not have any to send right now. can't even buy one. 

0 Likes

MSI has a local office, i had to tear down the machine to send it in but I have other machines I can use.

The motherboard was bricked so I had no choice but to pull the machine apart.

The replacement came 7 days later.

0 Likes

mantisman13,  I do not understand.  Please see my post from yesterday above.  I too get the Training error after 3970X RMA.  Please be very explicit about what systems you are running and what this means:

Put things back and did one more full mem test with the old system to end up with the same results. The fact that I get the memory training error before the bios even loads I think is a key tell.

 ... minimal down time and it's crickets.

 

Do you have two are one running system?  Why do you think the BIOS is not loaded when the Training error appears?  I assume it is the BIOS that is posting the error message - where else would it come from?  You do know you can run and run well in triple channel mode.  Please do not forget to install a fresh copy of W10 after changing from 3960X to 3970X or vice versa.  Enjoy, John.

0 Likes

I built a completely new system, new MOBO, System Storage, CPU so I could test the memory and CPU from my original system that was having the memory issues to better isolate whether the issue was with the MOBO or the CPU or due to some mem timing issue. The training error I have on the TiaChi board shows up (when it does, it doesn't always, but mostly does) before it gets to the F2/Del screen, so it's defiantly before it goes into loading the boot disk. I guess the BIOS could have loaded at that point, but I really can't be sure if it has or hasn't. At any rate the error comes up before you can get into BIOS to make changes and if you do, you will see no mem detected in the A slots. I'm thinking this is part of the boards built in memory check as you can see it going through the error codes with DR MOS.Maybe a new bios would fix it, but I kinda doubt it at this point. 

0 Likes

Thanks, mantisman13.  I was thinking that your new MB with both 3960X or 3970X worked but not the old - TaiChi(?) - did 3960X work with old board?  I have a copy of 'Aptio_V_Status_Codes.pdf' which has all the boot codes.  Obviously your MB is running the Training code as mine and failing so does not get through POST to show F2/Del of start loading W10.  The Stepping for both my 3970Xs (old and replacement) was/is SSP-B0 - yours?  Do both MBs fail now with one or both processors?  What is the old MB?  My 3970X scores almost 17,000 on Cinebench R60 and exceeded 17,000 on the old 3970X.  I am running in NUMA mode which helps memory performance.  Thanks and enjoy, John.

0 Likes

Misterj,  I only testing both cpu's on the newer creator board. Once I validated that I could run the full set of 256GB on the Creator board with either CPU at default settings I was convinced the issue was with the TiaChi board and not the 3970x and since the Windows system was activated on the TiaChi, I didn't want to mess with changing the CPU for it. I don't think the test would have been of any value at that point. Thanks for the link for the AM codes. I haven't seen that and it may come in handy. 

Cheers

John

0 Likes

PS, I finally found a TiaChi board on offer from Amazon, so I'll be swapping that out for new one in the next couple weeks. If the new board returned me to 4x and proves stable, I guess then I'll try the RMA process to get a replacement board. The question then would be do I build another box or keep it for a spare. These thread ripper builds are not cheep. 

0 Likes

Thanks, mantisman13.  There is no reason you cannot change the processor on an activated copy of W10.  There is some limit as to how much HW you can change but one processor will be fine.  I do recommend that you install a fresh copy of W10 when the processor version changes.  Enjoy, John.

0 Likes

misterj, sure a fresh copy is always nice to get rid of any of the weird corruption that is windows systems always get pledged with an especially if you are making significant changes in the underling hardware.  However, it takes hundreds of to reinstall and configure everything I ask this system to handle. Hesse the redundancy for storage, replication and backup schemes. Once I commit to a system install, short of total disaster, I'm not going back through that for anther 10 years. I guess this is still my burn in period, but I was getting a big deep into it when I started having issues. Windows of course will automatically swap detect the hardware changes and re-apply the lisc for minor changes like a CPU swap, but it can cause issues. For example, I had to re activate once I returned my CPU as windows had migrated my un-activated to the lisc for my 3970x when I tested on the creator board. The CPU is believe is the primary system identifier for the lisc. Window will do this a few times, but just how many and what their algorithm is, they don't tell you. At some-point you might have end of trying to get their support to reset the activation for you. So just not going to run up the strikes without good cause. Also, all of the TR 39XXx version all use the same controller drivers, so no issues there, but ya, if I were trying to migrate a windows install to a new MoBo/CPU combo, the driver transplants can get very dicey but doable. I know we talk about reconnecting of the SATA drives for the AMD expert raid above. I've done more reading on that since then and found many others with issues when the drives become unavailable and will have to rebuild. I strongly believe I will have to maintain the exact same port connection per disk to avoid the raid from getting confused as to what physical disk is what.  For instance, I had a cable that was just a bit tight to the side panel and put stress on the connect to the disk. The drive went down. All I did was replace the cable, but since the raid already through the drive had died, when the drive came back up it was seen as a replacement and the whole 22T raid had to rebuild and took days. It's been a few AMD updates since then, but it's just not something I'm going to chance. So each drive will go to the exact same port and will have been in all array are normal state before I start the hardware swap. If it all goes well, I just need to start up, go into bios, configure the SATA and NME to use RAID and it all should hook up. If it doesn't I'm f'd.

0 Likes

Thanks, mantisman13.   As you wish.  I continue to recommend it because I have seen many unhappy users wondering where all their cores went.  They see their new 16 core Ryzens with 4 cores after upgrade.  In my opinion, I think W10 needs to see the processor.  Different processors need different NUMA support and other scheduling support.  I have fixed a few RAID configurations and maybe can help there.  I almost always install a fresh copy of W10 when most anything changes including a major W10 release.  I have had to call MS a couple times over the years.  Enjoy, John.

0 Likes
misifu
Journeyman III

Hi, I have the same problem.

Gigabyte TRX40 DESIGNARE + 3970x + 128Gb (8x16Gb).

All of a sudden it started giving the same "PMU Training error at socket..." . And I found out that it was channel D does not work.

I'm the same, I don't know if it's the motherboard, the CPU, or something from the BIOS.
The ram works perfectly as long as you don't put modules in D1 / D2.

In the end, how did you solve it?
Thank you so much.

0 Likes

I had the same problem with the same motherboard, it was the 4 screws that hold the socket that were badly tightened by the factory.

Att,

Giancarlo Bergamo Cecilio

0 Likes
SmithAsani
Journeyman III

Hi...I also have seen this careful sort of mistake and issue, yet it was a little ways back on a significantly less ground-breaking machine (ASRock Z77 Extreme6 + 3770K). I wound up discovering I had a twisted pin on the motherboard attachment, this was causing an issue with DIMM #3. At the point when I put any RAM in that opening, the motherboard wouldn't boot, however it would OC different sticks in the event that you just filled three of them!

Myself and two other exceptionally proficient professionals, whom I know, all attempted a couple of times to fix the 5-6 twisted pins. No one would actually get the framework to work appropriately with each of the 4 DIMMs filled. I believe it merits checking the CPU's pins, however aren't the pins covered up on the motherboards attachment now? That is the manner by which my X570-E is with my 3950X.

You can discover recordings of individuals like LTT attempting (and at last), fixing bowed pins on CPUs. LTT even adds giver pins from an extra CPU, something I really discovered great (doesn't occur frequently with him). Never let anybody disclose to you that it's anything but difficult to do, however, it nearly comes down to karma with CPU/Mobo pins.

0 Likes
matsuokah
Journeyman III

Here I am 3 years later with a TR 3960x with what appears to be the same flavor of problem as the ones described here.

System was working fine for almost 2 years, then one morning suddenly was off and wouldn't post, sitting in what looks like a reboot loop ending in 0d code on the mb (Gigabyte TRX40 Aurus Master). 256GB RAM pulled and tested in another AMD system (not a TR), all 8 sticks are working. Moved CPU to a brand new mb (Asrock TRX40 Taichi), no change in behavior. Tried RAM from other working system, no change in behavior. Removed old RX480 graphics card that was running for the past 2 years and replaced with a much lighter Radeon HD5570, same behavior. Tried just getting to post with no HDD, same behavior. No matter what I do it goes straight to the 0d mb code.

Next step I will try based on reading this page: avoiding certain RAM slots, trying more combinations of fewer sticks. Will report results.

But 2 questions, if anyone happens back upon this: has anyone ever heard of a CPU suddenly going bad out of the blue with no change in operating conditions/no tear-down/etc; and is there any chance at all that this could be PSU-related? I don't have another PSU with 2 cpu cables and really don't feel like springing for $200+ for something that to me seems like a very very long shot to be the cause of the problem.

0 Likes

Updating... no combination of DIMMs made any difference, every try ran straight to 0d error code w/no post. Seems it's down to CPU or PSU, and as much as i'd prefer it to be the latter, it's hard for me to believe that could be it

0 Likes