My build:
ASRock TRX40 Creator | AMD TR 3960x | CORSAIR Vengeance LPX 32GB RAM (CMK64GX4M2D3000C16) | 2x EVGA RTX 2080 Super Hybrid | 2x Intel 660p NVMe | 2x Toshiba SATA HDD | Win10x64
My problem:
TLDR: I can't get memory recognized on slots A1 and A2 of the motherboard and I don't know how to tell if I have a bad mobo or a bad IMC on the CPU. Initially all slots reported RAM but system wasn't stable until it threw a Memory PMU training error after I went from default 2133MHz to 3000MHz (XMP 1) then back to default again.
I have tried multiple sticks of RAM in these slots. All other slots work (and all RAM works in other slots), but with configurations of 2 to 8 DIMMs (all the same RAM from the same batch) A1 and A2 give me "Memory PMU Training error at Socket 0 Channel 2 DIMM 0 & DIMM 1" (when both are occupied) or "Memory PMU Training error at Socket 0 Channel 2 DIMM 1" (when only using slots A2 and B2 per the Memory Configuration page of the motherboard manual for 2 sticks of RAM.
Initially I populated all 8 slots with RAM and benchmarked at 2133MHz. Then when trying to run a SfM benchmark (intended use of this machine) I got an unexpected reboot partway through. Tested the RAM overnight with WMD and came back to a frozen system in windows. Rebooted and event viewer said all the RAM was fine. Loaded XMP profile 1 (3000MHz) and benchmarked great with Passmark (99th percentile, 6778 total, 43468 CPU, 2908 Memory). Tried the SfM benchmark again and got an unexpected reboot partway through again. Reloaded defaults and BIOS finally threw the Memory PMU training error. The system was never stable at 2133MHz or 3000MHz until I got the error and A1 and A2 were disabled. Since they became disabled, I see Memory PMU training error any time a stick is in A1 or A2, and I have never seen any stick of RAM work in them again.
Since BIOS threw the PMU error I haven't had any system freezes or reboots. I can populate all six other slots of the motherboard and run at 3000MHz (XMP profile 1) for days without an issue. Any time I put RAM in A1 or A2, XMP won't stick, BIOS cycles several times, and memory drops to 2133MHz with PMU error (even if only 2 sticks - in A2 and B2). After giving up on this channel (channel 2 apparently?) I gradually filled RAM and tested at 2133MHz and 3000MHz for C2/D2, C1/D1, and finally B1&B2 and with all configurations I am successfully running at XMP profile 1.
Is it possible to test if it's a bad mobo or IMC without swapping out another one of either (or both)?
Seems like that error might indicate faulty RAM MEMORY from googling the error.
1- http://forum.gigabyte.us/thread/8389/memory-training-error-socket-channel
3- This link indicates a bad Motherboard since the User installed 3 different RAM Modules on the two non working DIMM Slots and one other User said a bent pin on the CPU caused the issue- https://www.reddit.com/r/buildapc/comments/e9jks6/2_of_my_ram_slots_wont_work/
Of course it could still be a defective Motherboard DIMM slots or controller or CPU.
Try running MEMTEST86 and see if any errors shows up with all RAM Modules installed. Make sure BIOS/UEFI is reset to Factory settings -Default. That should help eliminate defective RAM Modules.
I ruled out defective RAM because I can put 8 different sticks of RAM in A1 or A2 and none of them work, ever. But all of them work, always, if I put them in B1,B2,C1,C2,D1, or D2.
This is precisely what I was going to add to his question. I too have seen this exact kind of error and issue, but it was a little ways back on a much less powerful machine (ASRock Z77 Extreme6 + 3770K). I ended up finding out I had a bent pin on the motherboard socket, this was causing an issue with DIMM #3. When I put any RAM in that slot, the motherboard wouldn't boot, but it would OC the other sticks if you only filled three of them!
Myself and two other highly efficient technicians, whom I know, all tried a few times to fix the 5-6 bent pins. Nobody could ever get the system to work properly with all 4 DIMMs filled. I think it's worth checking the CPU's pins, but aren't the pins hidden on the motherboards socket now? That is how my X570-E is with my 3950X.
You can find videos of people like LTT trying (and eventually), fixing bent pins on CPUs. LTT even adds donor pins from a spare CPU, something I actually found impressive (doesn't happen often with him). Never let anyone tell you that it's easy to do, though, it almost comes down to luck with CPU/Mobo pins.
Companies don't intend anyone to fix anything by hand (for the most part) when we're talking about surface mount components. I deal with companies like Burson Audio, Orange Amplifications and I know the owner of Sparkos Labs (all make op-amps). Anytime something has gone bad, they tell me not to even try fixing them, haha. That is surface mount parts, dealing with micro anything is purely machine. The pins aren't really part of that, but they still are the physical I/O ports for the entire CPU.
All I can say is good luck my friend! If you can get one replaced on warranty, do that.
riveryeti, I have this problem and assumed it was the MB, but do not know how to suggest you differentiate. This is a user forum, so I suggest you contact AMD Online Support. They should also be able to tell you if they see many memory controller problems. It is a major pain to swap the MB, but it would be my first try. I have a 3970X with slot C2 problems. I have not tried C1 yet, but will soon. I assume you are running three channel mode. It should run well there - 75% data rate. I will try that if A1 -> D1, don't work. I am too tired to swap my board - would be forth time (3 on 2990WX). My 3970X ran fine for several weeks, then through a fit - crash and memory errors 026 decimal, Severe memory management error. I have a favor to ask: The message I get is close to "PMU Memory Training Error
Socket 0, Channel 3, Dimm 1". Can you post your equivalent error and tell me how you tied it to A1/A2? After some research I decided that PMU is "Power Management Unit" and I think it is code in the BIOS similar to SMU (System Management Unit). I think AMD releases it and MB vendors integrate it into the BIOS. SMU version is revealed by the AIDA64 application, do no know about PMU. Thanks and enjoy, John.
Thanks for that link, John. I couldn't find it, so posted on the user forum. I submitted a ticket to ASRock basically asking the same thing (how can I tell if mobo or IMC)?
My error was in post, and showed on a black screen. If I had the A1 and A2 dimms filled, the error was "Memory PMU Training error at Socket 0 Channel 2 DIMM 0 & DIMM 1", while if I had only memory in A2, I saw "Memory PMU Training error at Socket 0 Channel 2 DIMM 1".
I was able to tie them to A1 and A2 because after post, when I got into BIOS config, both of those channels showed 0 MB DDR4 installed, while the others would all show 32768 MB where I had modules. (BIOS main screen under Total Memory for my ASRock board)
riveryeti, thanks much. Not much correlation there. I was thinking Channel 3 would be C. But if Channel 2 is A, then what the heck? I will look in the BIOS the next time. I'm doing some cleanup work before I start serious debug. I have a Gigabyte TRX40 DESIGNARE and four sticks of 8GB G.Skill B-Dies.
Do you mean the link I supplied did not work? It works fine for me then and now. If it fails for you then please search for "AMD online support". I think it is important for you to talk to AMD. Thanks and enjoy, John.
riveryeti, I have seen this now on three different MBs and three different processors but all TR. RMAing the MB did not help me. I am going to open a support Ticket with AMD. When I get running again, I will try a very slight increase in SOC voltage to see if that helps. I will increase by 10 or 20 milivolts. I just scanned my screenshots and found 2990WX at 1.0 Volts, a 3970X at 1.1 volts. Some other processors were as low as 0.825 volts. If you are interesting in trying this, then check you current value (Ryzen Master) and up it a little. We may need to do this in BIOS to get through boot. If you do try, please let me know the results and I will do the same. Thanks and enjoy, John.
Good idea to open a ASRock Tech Support Ticket. That way they can decide if you need to RMA your motherboard to be checked and tested for being defective.
You can check the CPU by installing it on another compatible motherboard and see if the same thing occurs. If it doesn't then it is a good indication the motherboard you have is defective.
If you can't test your CPU on another compatible motherboard, then I suggest you open an Online AMD Warranty Request Ticket. That way you can explain the symptoms you are having and AMD can determine if the CPU needs to be RMAed. They may suggest you run certain tests before determining if you need to RMA the CPU.
You can open an Online AMD Warranty Request from here: https://www.amd.com/en/support/kb/warranty-information/rma-form
Got the response below from AMD this AM... then called ASRock to see if they have a loaner mobo I could try and they suggested re-seating the CPU after checking for bent pins (which I did before install and installed very carefully), and they said if that still didn't work they'd replace the mobo, but if it still doesn't work, then it's the CPU...
I'm in this weird place where I can't tell if it's the mobo or CPU unless I have another of A or B, which I don't <sigh>.
------------AMD Customer Support email:
Thank you for the email
Seeing the issue and troubleshooting performed, it indicates an issue with Memory controller on the CPU. I request you to try the CPU on a different computer and check the status.
If the issue remains same, please claim warranty for the CPU using below link
<snip>
-------------------------------end of email-----------------
riveryeti, it will be a couple days till I can test the SOC voltage boost. Are you willing to try before you dismantle your system? Remember, just 10 or 20 millivolts. Thanks and enjoy, John.
riveryeti, I have received a couple of responses for AMD (ticket to Support). The first simple asked if I had tested my 3970X in another MB (have not and not really feasible). I ask for a response to my question about upping SOC voltage. They responded that this was overclocking and would avoid (void?) my warranty. I responded by asking for an answer to my last question: "Are you seeing lots of these problems?" Hope to get an answer. Certainly will not change my SOC voltage. Enjoy, John.
Are you running the latest bios for you board. Many boards have gotten much better with updates. For instance I could not run 4 sticks in my B450 board and finally a Bios update fixed this. Sometimes the latest bios can be the issue too and regressing one might help. If you regress however make sure that bios supports the cpu you have. Worth trying if you have not. Saves you from swapping out CPU's with an RMA if it helps.
My apologies if you already did this, I didn't see it mention above if you did.
Thanks, pokester. I can only speak for me and my system ran for several weeks with no issues both SPD and XMP (3200MHz). Then this hit the fan. My 2990WX ran for months and even ran for two weeks after an MB RMA before showing this error. Since 39xxXs are new with a new Chip Set and socket, there is no back BIOS as far as I know. Gigabyte did release a new BIOS that truly killed my system so I went back but will go forward to the latest soon. Thanks and enjoy, John.
Sorry to hear that. That really sucks. Being an early adopter of new stuff runs those risks. Not that it is a reasonable risk and certainly you should not have to suffer because of it. If they can't offer help request they send you maybe a different board. I hate to pick on Gigabyte but I had my fill of their boards not working in recent years. Issues I never have with Asus and MSi.
Anyway I hope you get it resolved with too much more aggravation and spending more money. This stuff should work right when released IMHO.
I mostly agree, pokester. This is my first experience with Gigabyte. MSI, Asrock and a few older ones have all angered me. I have done 3 MB replacements on this system and will probably not do another one. Almost all my past RMAs have been MBs and I am tired. I have opened a support ticket with AMD. I'll see what happens. Thanks and enjoy, John.
No doubt you get lemons and at every generation who makes the good boards changes. That is why I usually don't adopt quickly and choose my purchase by reading reviews and picking the parts that say they work best with what I am buying. However at this point bios fixes should have things working unless they made a bad board to begin with or you just have a lemon. I feel your pain.
I have stuck with MSI as they at least have some staff on their forum who can get your RMA approved if needed.
I have looked at TR memory for quite a while as I have also had memory issues.
I am aware that 4 layer motherboards perform poorly compared to more expensive 6 layer boards. Probably the reason my X570 was more expensive than the X470.
Hi, did you finally solved your problem? I have exactly the same. Mobo / CPU and memory slots A1 & A2 failing PMU training. I have been two times on the local dealer from whom I purchased and no luck so far to getting the MB or CPU exchanged. I actually left them on their lab today to see if they will honor the warranty.
My memory is fine I tested it with memtest86 with zero errors. And those memories where installed on a first gen TR + Asus board also working fine.
Dealer said it is due to dirt/thermal paste on the CPU "pins" however he cleaned it and worked with one memory. I got it back to the case, installed all the memory again and it worked for a few days but started to fail again allways on the same A1 and A2 memory slots. I purchased a second MB from Gigabyte Aorus brand online and waiting for delivery to rule out CPU vs MB defect. After that I will have to fight with the local dealer to get a refund on the defective component I think.
I have been assembling my own PCs since year 1995. And I believe this is the first time I see something like this happen.
Have basically the exact same issues.
I've been having major issues that seem to be related to the A1 and A2 slots on my TRX40 Taichi with a AMD 3970x and 256G kit of CORSAIR Vengeance RGB Pro CMW256GX4M8E3200C16. Microsoft Windows 10 (10.0) Pro for Workstations 64-bit (Build 18363)
Issues started after running hard for 2 weeks running folding@home with 2 cpu clients (32 thread and 24 thread) and a gpu client using AMD rx5700xt slight overclocked. No over clocking on the CPU or Memory and Thermals all were well handled by case cooling and AIO. I first noticed the issue with F@H when it crashed over night and after that it would BSOD after tying to start folding again. This was a day after a Microsoft updated and also installing node js and vuejs development packages. I original suspected software or driver conflicts, so I made sure to update AMD chips set to amd_software_2.04.04.111. This didn't help. I then also discovered running Cinebench r20 would cause BOSD as would CPU-Z bench or Stress. The BOSD's were a verity of messages. MEMORY_MANAGEMENT,IRQL_NOT_LESS_OR_EQUAL, PAGE_FAULT_IN_NONPAGED_AREA etc., but all pointed to crash address of ntoskrnl.exe+1c2390 when the minidumps were viewed with BlueScreenView. I started to suspect memory when I noticed I was running 32G low. Going into BIOS I found that A2 slot was not showing up. I also was having issues getting in and out of BIOS as the usb wireless keyboard was not working most of the time when the system would come back up to the post screen. I had to clear the CMOS and full power down and back a couple times to get back into BIOS and set things up again. Sometimes all 8 slots would show fine and I could get it going again a would get back into window. At one of these I found going into the iCue software there was a Firmware update for the Ram and I ran that. After that I had a XMP profile that I could chose from in the bios that I don't think had been there before and setting that initial seems to help. But it would still crash and I would get back into BIOS and see empty slot A1, A2. I then created a USB boot for MemoryTest86 and started running tests with isolated ram. I tested only B2,A2 without issue. B1,A1 no issue. All memory was testing fine and I booted up into window with B2,A2 but I think i had a crash and start to test the memory again. I spent almost 2 days running memory tests and found no memory errors. My last test was back to the full 8 chips loaded and all seemed fine with the memory test. I then tried to boot up into Windows, but had issues with the BIOS freezing up on me a few times and also moving from English into kanji langue and freezing. I then flashed it to BIOS 1.6 and brought it back up and reconfigured. Raid options where now showing again (had been missing in 1.1) I have 2 raided NVME 1T drives and 8 SATA drives for 22T Raid10 using the AMD raid drivers. It seems to take a couple cycles to going in and out of the BIOS to get the raid to hook back up so that windows boot manage would could boot. But along with that I stated seeing a warning flash "Memory PMU Training Error at Socket 0 Channel 2 Dimm 0" and if I would go into the BIOS both A1 and A2 would not show up. After removing both from the system so that I have 2 Channels of Tri-channel memory on B2,C2,D2 and B1,C1,D1 I seem to be error free. All benchmark and stress test are running with out issue and I'm folding with the CPU and GPU all at 90% as I wright this.
I should also mention I had uninstalled F@H, vuejs and other recent installs to no avail. I have not reinstalled node or vue yet.
I did another test today where I swapped out the memory that has been running fine in B1 and B2 slots for the Mem I removed from the A1, A2 slots and those worked just fine and I've been folding on them all day. I also then tried putting the memory from B slots back into A slots to try to get back to 256 quad memory but I could not even get to the post screen on 2 restarts. Both times I got a 0d error on the board. I did a full power down on the PSU and tried to boot and this time got to the ASRock screen, but it would not respond to the keyboard and did not try to load window boot manager. Powered down, removed the 2 chips in A's and it started up and boot right up without issue. So I really need to know is this a problem with the board, with the cpu or still some driver/bios thing. How to test?
Did yet another test where I tryed D1,D2,C1,C2,A1,A2. This booted up, but I got a BSOD after just a few mins with no real load.
So like described in this thread by others, all of my memory chips work fine so long as they are not in an A banks.
One thing that the talk about cpu pins makes me consider is that the Threadripper chips do not have pins. They have little contact spots and there are spring pins on the socket. The CPU and the cooler get torqued down. I seem to recall watching a LTT video where he had an problem getting a system to post with dual XENON cpu and the fix came down to getting the right mounting pressure. I wonder with running at the hotter range for a couple weeks none stop if my cooler cpu torque pressure has changed and giving me this odd issue. I'm going to try backing off the torque and then retorquing with the cpu tool and test again when I get to my next shutdown opportunity. If that doesn't work, then pulling it out, cleaning and remounting. I wonder if it would be so simple. I know when I do hot laps on the track I need to retorque my wheel lugs. Thermals expansion is real a thing.
That's all I've got at this point aside from bad MOBO or CPU and like every one else, I'll have to get either another one of each to test things out. It would be great if there were some test kit we could boot off a USB drive that would be able to give a direct answer.
John Glassman
MantisMan LLC
I ended replacing the Asrock TRX40 Creator for a Gigabyte TRX40 Aorus Master on my expense (no warranty). And my 3960x has been rock solid for about a month now.
This Gigabyte board is a lot better than the Asrock. Better components, better layout, better connectors. Not a single memory error.
drdocumentum wrote:
I ended replacing the Asrock TRX40 Creator for a Gigabyte TRX40 Aorus Master on my expense (no warranty). And my 3960x has been rock solid for about a month now.
This Gigabyte board is a lot better than the Asrock. Better components, better layout, better connectors. Not a single memory error.
I have gone through the same problem with lower end boards. Thanks for posting your results.
For the most part I've been extremely pleased with the Taichi. The Bios could be a bit easier to understand, but for the most part it's clear. I think it's target at people who have a lot more overclocking experience than I have. As far as components go, it's a step up from Creator and should be able to take the heat. I'm seeing 90c avg now on my CCD packs but I was seeing under 80c before, which makes me think I may also have an issue with my Nzxt Kraken x62 that I didn't mention above. It doesn't seem to be adjusting fan or pump rate and Cam software losses connection shortly after boot, leaving it running at whatever the rate was at that point. I'm still keeping in range, but not boosting over 4k constantly like I was before. My last C1,C2,A1,A2,B1,B2 test that gave me a BSOD the temp never got over 50c, so that's not what is causing the memory issues. Since the MOBO is kicking that 0d error without even getting to the post screen, I suspect the issues is with the board/cpu, so I still have to try out my mount pressure idea. I'll post back with those findings.
Worked for a moment.... I pulled the Kraken 62x and loosened the cpu mount torques. Cleaned, Re-torqued, new thermal past on the aio plate only and remounted the aio and populated the A slots. System came up and booted into windows. Showed full 256. Ran CPU-Z stress for several mins. Did a few Cinebench benches, Started folding and let run for an hr. All seemed right. All except my Nzxt Cam software has been giving me an issue where it is loosing the connection to monitor the liquid temp, fans and pump rpm and stops dynamically adjusting based on the cpu temperature. This is evidently causing me issues where my avg cpu temps push the upper range of 95c and spike higher. I've been testing this out and find if I remove the USB connection from the pump head that the fans runs on high and the pump must still run as my cpu avg is back to 83c after 12hr of folding. But back to running with full ram... So while all seemed to work fine, I did notice some oddities. In CPU-Z, in the SPD tab, memory slots normally report as slot # 1-8. What I saw here was Slot 1 - 2 (which are normally empty when I do not have ram in the A slots) were showing the normal setting for the ram, yet slots 3 - 8 were blank. I closed CPU-Z and opened again. This time I had info in slots 1 - 4 and then the numbering skipped to 11-14. Dim reporting in HWiNFO64 reported the correct MOBO locations for all ram. So i figured it has something to do with Window caching info from the bios and reasoned a reboot would clear that up. On reboot, I decided to go into bios and look at the ram report. That looked right and while I was in there I enabled the PS/2 Wake up and also changed the Hardware monitor cpu2 fan from fan mode to pump mode. Saved setting and tried to reboot. This returned me straight back to the 0d bios error where it would not even get to the post screen. I power cycled and it got past post and tried to go into window but immediately hit a BSOD. Powered down, retired, same result. Both were IRQL_NOT_LESS_OR_EQUAL errors. Powered down again. Removed A Dimms. Rebooted, got back into bios and revered my changes, rebooted, hit the windows repair, restarted and all was good again. I sort or doubt my bios changes were the issues but I haven't confirmed that yet.
Swapped out 2 8G memory for the 2 32G in another system and they have been running fine for 14 hr of folding but at a different timing setting. The MSI gaming pro board doesn't seem to recognize the XMP profile and I had to set the closest 3200 settings I could get. They are running at Dual, DRAM Freq 1599.1 Mhz, FSB:DRAM 1:16, CL 18 (not 16), RCD 20, RP 20, tRAS 38, tRC 75 , CR 1T. So the main difference is the CAS Latency is a bit slower. I had tried to adjust that to 16 in the bios, but it didn't take I guess.
But this all gets me thinking back to when I first set up the Threadripper. I don't think I had XMP profiles to select from and originally I was using the BIOS set default which may have been 2132. The dims report a Max bandwidth of DDR-2132(1066 Mhz), but are picked and rated for 3200Mhz by Corsair. During my original bench marking I went in and started picking higher memory setting and letting the BIOS auto fill timings. I had crashed at higher level like 3600 and I forget where I got things stable. I may have been 3200, but then again it may have been a bit lower. I do recall that where I landed greatly improved all my benchmarks, hitting well over 17k on Cinebench r20. I'm only able to get to mid 16k with the 3 way memory. Interestingly, while I did have the 256 running in quad briefly, I wasn't breaking into 16k, just upper 15K, so it was slower and my CPU cores that normally boost up above 4K were stuck at 3.7.
So, could I still be just dealing with a memory timing/voltage issue here?
mantisman13, I think I have the same problem as riveryeti, the OP, and maybe you. Based on his advice, I removed the RAM stick from the offending channel and have had NO problems since. It sounds like you have identified the offending channel, so I advise removing the sticks from this channel and see if it will run.
What version of CPU-Z are you using? I have seen the strange results you describe on older version that do not support our MB/CPU. Latest right now it is 1.92.0. iCue and the other software you are using are not to be trusted. I suggest to uninstall all those applications and run CPU-Z, AIDA64 (paid, but trial available) and Ryzen Master (RM). I have no other applications to trust. We still do not know what riveryeti did, but I suspect he RMAed his processor. I have not, simply because I am tired of changing things in my system. Please try a three channel setup in your system, and let us hear how it goes. I would suggest a Clear CMOS, RAID setup, then see how it goes. BTW I get a Cinebench R20 over 17,000 (all cores), with my three channel system and no other tuning or OCing. I have set NUMA mode active (see here) which seems to help my memory performance. Enjoy, John.
PS: This is a user forum and seldom does an AMD employee read/post here - support.
I have seen this memory issues on another forum with the TRX40 boards from Asrock. I believe they might have a design problem or a nasty BIOS bug.
I remember from my tests (wich I did for about a month everyday before giving up and purchasing a Gigabyte TRX40 Aorus Master). That the board worked ramdomly sometimes rebooting into BIOS and reloading defaults worked fine for the day. But next day the "BIOS charge capacitor" was depleated and the issue reappered.
I also tried a lot of DIMM combinations. And they semed to work for a period and then started failing again. When the quarantine was lifted on my sector I went to the dealer and he said that I must comform to the QVL. Well, I explained him that I have sets from Crucial, HyperX and Corsair (I bough memory two times also suspecting there was a memory defect) and all had the same problem. So, he kept the mobo and CPU for a three day testing and confirmed with his own lab memory that was the motherboard after all. He tried my CPU with a new Gigabyte TRX40 Aorus Pro Wifi that he had in stock and the issues went away. He finally ended exchanging me the Asrock for the Pro Wifi. Since I had already purchased a new more advanced model board online, the exchanged one ended sitting on my home storage and planning to sell it online.
The lesson I learned on this issue is that no matter what combination or test you perform. The board fails ramdomly at different times with sometimes different BSOD error messages.
I also believe that Asrock knows about this problem because my board got a little physical damage due to all the assembly/disassemby tests that I did, and they agreed to RMA it anyway to the local dealer after he told them what the issue was (he also sent them pictures of the damage). My local dealer was very open and understood that the damage was a consequence of the board random failing and I thank him for that.
So, if you have the chance, just return your Asrock and buy a board form another brand (Asus, MSI or Gigabyte). You will save a lot of time and headaches.
drdocumentum, I do not know what MBs or processors you are/have been running, but in my case this is a Processor problem. I have seen it on MSI (2990WX) and Gigabyte (3970X) but not on my ASRock (1950X). I strongly believe, with the OP, this is a memory controller problem. I suspect if you depopulate the offending channel, your system will run OK, just with a slower memory bandwidth. Enjoy, John.
So you haven't read my posts. I have a 3960x. The CPU is working fine on gigabyte TRX40 Pro Wifi and Aourus Master (I have both boards) but fails on the Asrock TRX40. Dealer confimed it was a board problem as he tested both on their lab.
Thanks, drdocumentum. I hope your system continues to run well. My current MB is Gigabyte TRX40 DESIGNARE and runs fine on triple channel setup with NUMA enabled. Enjoy, John.
Hi misterj (also john). I'm guess your using email. There are a bunch of post you may have missed. You are right, I'm completely stable using 3 way memory, so long as I stay away from the A channel (see above "Did yet another test where I tried D1,D2,C1,C2,A1,A2. This booted up, but I got a BSOD after just a few mins with no real load."). This does point to a board issue, but as I was running fine for 6 months and I'm having some issue with my AIO software, I know things got a bit heated towards the max. This may have damaged something or in the software updates I did, I may have gotten the bios into a un-stable set up that I just need to find my way back to.
Thanks. The primary reason I bought a threadripper was the 8 slot memory support. I use my setup as a virtual machine server for my work's development/consulting projects and I need to use about 8 big VMs so I need lots of memory and all the channels working. I don't really care much about having a lot of cores as my workload is memory limited not CPU's so I purchased the 3960x instead of the 3970x or 3990x CPUs. Regarding the Designare, I read that you need to put the thunderbolt card on slot 4 or the system won't work fine giving memory errors. I also read a user that reported that XMP profiles were not working when the card was installed. So maybe that is your problem.
Interesting idea. Only pcie slot I'm using is the top one for the GPU. I used the 2 onboard m.2 sockets for my system mirror. But perhaps it's more an issues when using raid vs sata and memory. At any rate, the "only" software that was updated before everything went SNAFU was the NZXT Cam AIO software, and installing nodejs and vuejs and bootstrap vuejs, docker for window had and update and of course window itself has a big update... after I started having issues I updated the AMD chipset which only updated the I2C Controller. The iCue firmware updates to the memory came after issues began but before the system was so unstable it wouldn't run at all. That also didn't happen until I stated using the XMP profile. I think the XMP profile showed up after the memory firmware but before I updated the bios from 1.1 to 1.6 and that actually makes sense. Now, you mentioned setting your voltage to 1.35 as your memory is specified for. Mine is the same at 3200 and looks like the board is trying to over volt it a tad to 3.65. I've read that adding an extra .1v can help with memory timing. But 1.35 is the XMP profile that the manufacture is saying they should be stable at. The JEDEC timings all use 1.2 V, so the .15 up to 1.35 is already a over clocked voltage. So I think that is my next move.
Like you, I want all the Ram and more. Not a gamer so less concerned with memory latency. I do a lot with database development and web services and getting into building for docker deployments, so I need both memory and cores. May talk myself into springing for 3990x and plan to build one more box up, but it's been a while since a good payday, so need to try to keep thoughts like that under control.
I use my Graphics Card on slot 3 because I can't fit it on the first slot as I use a Cooler Master Wraith Ripper which is a very big air cooler. The Gigabyte BIOS has a setting to select the primary graphics card slot so it is not really important.
BTW: I am using an air cooler because my local dealer told me he has received many issues with pumps exploding inside the case due threadripper heat stress (I had a water cooler AIO installed first). Since people uses the TR machines for work they kept them on always 7x24 and water pumps are not very reliable for that kind of continuous workloads.
A comment regarding NUMA. On a first or second gen TR CPU it does helps as the OS can see what cores are near and which ones are far to RAM. But on a 3rd gen NUMA is not the right choice. Current Threadrippers have a separated memory controller die that feeds all the CCXs via Infinity fabric and there are no NUMA nodes anymore. The important parameter for a 3rd gen TR regarding memory performance is to keep the Infinity Fabric clock in sync with the memory clock. For 3200 Mhz modules. That is 1600Mhz IF and Memory clocks. On my tests under the Asrock I could see that sometimes the BIOS on that board configured the IF as low as 800 Mhz while the memory was at 1600Mhz. That setting can be observed on the CPU-Z NB Frequency: (this is my current setup, it shows a difference on a few Mhz but that is because CPU-Z reads NB and DRAM serialized and the value is not 1600 Mhz perfect as MB clock generator isn't allways fixed)
So are you suggesting I should set my infinity fabric timing up to 1600?
Yes if you want to get the best performance memory-wise. AMD states that the IF can run in sync up to 1800 mhz for 3600mhz RAM. After that, you can go higher on the RAM but not on the Infinity fabric. So you will be running those on an unlinked config. That kind of configuration is not recommended because it will introduce latency as the IF needs to wait for synched clocks to talk with the memory bus.
One more thing I forgot to mention: Memory voltage. My memory requires 1.35 volt to reach the rated 3200Mhz. After setting the correct value and rebooting on the Asrock, going to the system monitor section on BIOS you can see that voltage was reported as 1.376 V and also the value was not stable, sometimes It changed to 1.365. I tried lowering the value to match to 1.35 and it was impossible. I did show that to the dealer as an argument against the board quality and he agreed that it was not normal after comparing the voltages on the Gigabyte that on a 1.35 setting the monitor reported 1.356V. I also found online pictures of the same issue for the Asrock. due that I suspect that the memory voltage regulation subsystem on the Asrock might be flawed by design.
Good Idea, I'll get back into BIOS tonight and see what the voltage is set to. HWiNFO64 is reporting my DRAM voltage as 1.366V just as you noted.
mantisman13, what??? "I'm guess your using email." I post just like others do, just have not posted here in a long while, but when I saw your posts thought I would comment.
HWinfo and many others give poor results. The ONLY application to use is Ryzen Master (RM). Please post a screenshot - simply drag-n-drop the image into your post. What score do you get in CB R20 for all cores? I think a memory voltage a 100 mV high is a clear indication of MB quality but seriously doubt that it will make the memory less stable. In fact, maybe more stable. I have the same memory in all my systems - G.SKILL Flare X F4-3200C14Q-32GFX,. These are Samsung B-Dies and all run on 1.35 volts.
My RM:
Have either of you looked into Event Viewer for errors? If not, please do. If you find any please expand Details in the lower panel and paste the data here. Thanks and enjoy, John.
Just to clarify regarding excess voltage. The Asrock was on my case at 1.376v which is 19% more voltage over the configured 1.35. I believe that is too much for an acceptable error margin. Memory was actually getting very hot so I believe It is an important defect for that motherboard as it will compromise memory life on the medium or long term.