cancel
Showing results for 
Search instead for 
Did you mean: 

PC Processors

reeflex
Adept I

Trying to run 8x 128gb with Threadripper PRO 3995WX

Hi there,

 

Are there any adjustments needed to run 1TB of RDIMMS with AMD Ryzen Threadripper PRO 3995WX? Any adjustment to VSOC values, etc.?

 

Motherboard: Supermicro M12SWA-TF

Chassis: 747BTS-R2K20BP-OTO-11

RAM: Micron MTA144ASQ16G72PSZ-3S2E3

Front fans: 3x FAN-0138L4 (7500 RPM); 1x FAN-0114L4 (5000 RPM)

Rear fans: 2x FAN-0082L4 Rev.B; Ordered 1x MCP-320-00046-0N-KIT so there will be 3 rear exhaust fans

GPU: Zotec RTX 3070

 

I just installed 8x 128gb ram and my system restarts after a couple minutes of TestMem PCBDestoyer config

 

Not sure what else might be happening, but VRMABCD and VRMEFGH are overheating:unnamed.png

 

Thanks for any help

0 Likes
52 Replies
misterj
Big Boss

reeflex, I do not think so, especially to SOC. Micron does not seem to know about mta144asq16g72psz-3s2e3. Who did you buy the sticks from and what is the make and model. I would like to see a screenshot of Ryzen Master (RM) running your workload that led to the restart. Also look in the Event Viewer (Windows?) and post the Details tab of a few Critical errors. Post all your components.

I did some searching on the Internet and am curious how many places you posted this problem and if you have  any useful answers?

John.

Hey John,

 

Thanks for the reply. Sorry for delay. I thought I'd get an email notification if anybody replied here.

 

I got it from ebay https://www.ebay.com/itm/387071754998 from China... I didn't realize how little information there was on it until it arrived. It's not on the not on the QVL list which I realize isn't good. Here is the CPU-Z ram info: 

 

reeflex_1-1725060523036.png

 

I tried to return it, but they aren't giving me a return address in english (ghosting me) so hoping I can make it work.  I'm going to try these heatsinks https://www.amazon.com/gp/product/B0B7F3HZ1M  and thermal paste https://www.amazon.com/dp/B0087X728K .

 

I just tried installing Ryzen Master and doesn't seem like I can:

reeflex_2-1725060986460.png

 

Here is ZenTimings if that helps:

reeflex_3-1725061093062.png

 

It's been a few days since I did the memtest, but I think these are the errors; it is Windows:

EventID 46 (system):

reeflex_4-1725061520860.png 

 

(eventdata):

reeflex_5-1725061544070.png

reeflex_6-1725061634129.png

 

reeflex_7-1725061650642.png

 

__________________________________________________

Event 1796 (system):

reeflex_8-1725061691214.png

reeflex_9-1725061807480.png

 

 

I have these components:

Motherboard: Supermicro M12SWA-TF

Chassis: 747BTS-R2K20BP-OTO-11

RAM: Micron MTA144ASQ16G72PSZ-3S2E3

Front fans: 3x FAN-0138L4 (7500 RPM); 1x FAN-0114L4 (5000 RPM)

Rear fans: 2x FAN-0082L4 Rev.B; Ordered 1x MCP-320-00046-0N-KIT so there will be 3 rear exhaust fans

GPU: Zotec RTX 3070

PSU: 2x PWS-2K20A-1R Rev 2.4

reeflex_10-1725061943558.png

 

Yea I did post to overclock.net and servethehome to try to get some help as well. It's been helpful with the conclusion leaning towards heatsinks on the VRM chips

 

A guy on this thread seemed to have a similar vrm overheating issue seemingly solved by heatsink and 40mm fan so I'm hopeful this can work:

https://forums.servethehome.com/index.php?threads/overheating-problems-with-2x-epyc-7742-in-define-7... 

 

Please let me know if I missed anything that can help here. I'm learning as I go.

I appreciate the help.

 

0 Likes

Okay you seemed to be in very good hands with MisterJ.

 

He will solve your problem easily.

 

Take care.

 

(- :

0 Likes

Awesome to hear, thank you!

0 Likes

Thanks, reeflex, Ryzen Master (RM) should definitely support your processor. Where did you get RM? DL it here. Why do you think your VRM is over heating? I do not recognize the utility you are using and therefore do not trust it. I would suggest using AIDA64 Extreme. It is a paid utility but has a trial period. Your utility says your memory is too hot not VRM. Please post a clear picture of your MB, especially the VMR area. I will look at the other stuff later. John.

Hey John,

 

Much appreciate the help.

 

I got RM from the same source you linked to. I get this error when trying to install the 3000 series and newer version. 

reeflex_0-1725144387768.png

 

I got the other error when trying to run the 2000 series and older. It was more so a test since I couldn't even install the 3000 series and newer version, but I could at least install the 2000 series and older version.

 

I thought the VRMs were overheating because the VRMEFGH reached 102C here and the VRMABCD was at 86C. This would be the RAM getting too hot though?

reeflex_1-1725144561567.png

 

This screenshot is from HWiNFO, but I'm down to try any other tools that can help here. I just installed AIDA64 Extreme. Is there anything in particular you want me to screenshot from there?

 

Here are more MB photos. Is the VMR area around the CPU? Please let me know if you need me to take more photos to zoom in on anything.

 

IMG_9897.jpg

IMG_9898.jpg

IMG_9899.jpg

IMG_9903.jpg

 

Thanks

0 Likes

Thanks, reeflex. I was looking at CPU_VRM at 46C and thought that VRMABCD & VRMEFGH were memory sockets. The MB manual does not explain.Looks like your memory does not have heat sinks. Please look in the HWiNFO manual for definition of these. Looks like your memory does not have heat sinks. Is the little (40 mm?) fan I see over the VRM, chip set or what? Where do you plan on placing the VRM heat sinks? Reading about the BIOS gave me Deja vu. Have not seen the old BIOS in years. Be sure to ask your builder about blocking RM and if you can remove their software. I bought a Laptop retail a few years ago and the first thing I did was format the C: disk and install a purchased copy of W10. I have a 3970X and am running W11. You should be able to also with a TPM installed. At least for now remove the Micron memory use some other DDR4. Good photos. Thanks, John.

0 Likes

Hey John,

Thanks for the reply. supermicro support pointed out these are the VRM locations:

VRM locations.png

 

So I'm planning to put heatsinks here to start with:

VRMEFGH:

VRMEFGH_heatsink.jpg

 

and VRMABCD:

VRMABCD.jpg

 

The 40mm hasn't arrived yet nor have the heatsinks. But I was planning on putting the fan somewhere over the VRMEFGH heatsinks if it needed; could be a tough install.

 

I did send supermicro support asking about installing RM, but they might not respond until Tuesday because of labor day.

 


I have a 3970X and am running W11. You should be able to also with a TPM installed. 


What is TPM here?

 

You think there is a chance these VRMs can be cooled down enough? Or should I keep after trying to return these? I'm concerned that since I asked for a return address in english and got no response, that if I am able to return it maybe they would say they never got the return and I wouldn't get a refund.

 

 

0 Likes

Thanks, reeflex. VRM is Voltage Regulator Module, which creates the core Voltage and other lower voltages from 12 Volts. The temperature looks OK unless I am not reading HWinfo correctly. I suggest you buy a couple sticks of memory on the MB Tested list and remove the 128GBs. TPM is Trusted Platform Memory, your MB has a header for one (see manual). It is required to install W11. There is a fan in several of your pictures that looks like a 40mm. I think VRMABCD and VRMEFGH are the memory sockets for the 8 (ABCD, EFGH) memory sticks. This is why I am asking about "How is the core voltage and memory voltages created?". You will need to ask the builder. Please post which link is to the advice about cooling the VRM? John.

0 Likes

Thanks John. 

 

Oh wow I'm supposed to have a TPM to run windows 11? I'm currently running windows 11 without one, but I need to buy a TPM chip and put it into the TPM Header? 

Ah, yea there is a little fan. I'm not sure what it is sitting on top of. 

That's interesting I'll have to check with support 'How is the core voltage and memory voltages created?' and let you know what they say.

 

I've been discussing trying to cool down the VRM here:

https://www.overclock.net/threads/official-amd-ryzen-ddr4-24-7-memory-stability-thread.1628751/page-...

 

I came across this thread and advice that seemed relatable. This system was overheating with large sticks (granted it had two cpus):

https://forums.servethehome.com/index.php?threads/overheating-problems-with-2x-epyc-7742-in-define-7... 

Where this solution seemed to help:

reeflex_0-1725161699935.png

 

0 Likes

reeflex, we are getting away from the original problem and some of these questions require advice from your builder and do not really belong here. If you are running W11 without a TPM, the the builder has built the W11 installed on your system to ignore the lack of a TPM. The security provided by the TPM is thus not available. I do not recommend it, but that is just my opinion. A TPM costs less than $20 for my board (Gigabyte) at Amazon.

Did your system come with memory? What make and model and how much? I suggest you remove the 128GB sticks and insert the original sticks. Post the restrictions you have on altering your system to avoid violating your warranty. Please post an image of your board outlining the various VRMs. I beginning to understand that there are three VRMs on your board CPU_VRM, VRMABCD and VRMEFGH. Did you really spend 112 Euros for the adhesive to attach your VRM heat sinks?

John.

0 Likes

reeflex, open a Run Dialog and type in tpm.msc. To see status of TPM. John.

0 Likes

Hey John,

 

Thanks for sticking with me.

 

You're right. Here is the TPM status, looks like I have one.

reeflex_0-1725223039858.png

 

 

It came with 4x 32gb SK hynix sticks (HMAA4GR7AJR8N-XN):


Didn't have an issue with the 32gb sticks.
 

I'll get the restrictions from Supermicro and share those. 

 

Supermicro support pointed out these VRM locations:

reeflex_1-1725223166249.png

 

So I'm planning to put heatsinks here to start with:

VRMEFGH:

reeflex_2-1725223193750.jpeg

 

 

and VRMABCD:

reeflex_3-1725223193612.jpeg

 

I'm not too sure about where CPU_VRM is, but it might be one of these with heatsinks?

 

IMG_9911.jpg

IMG_9912.jpg

 

Haha no, I ordered this for $8:

https://www.amazon.com/gp/product/B072MSXHJD 

 

Unfortunately though, the heatsinks don't arrive until Sep 7, and I have until Sep 5th to send off the RAM. So a bit of a pickle. 

  

0 Likes

reeflex, no, I'm still here. Short reply-dinner time here. I have been looking at the details of your MB and other latest MBs. They all look the similar. More later... John.

Welp, the heatsinks don't arrive until later this week. But I did rig a little 40mm fan over the VRMEFGH and the system still rebooted within a minute or so of running testmem5. I think VRMEFGH was around 85C when it rebooted so maybe the heat of these isn't the only issue. Maybe it's more of a power issue.

Unfortunately, looking best to just escalate the return of these sticks with Ebay and go from there.

0 Likes

reeflex, with two monster PSs, I do not think it is a power problem. I would be much more suspicious of the 128 GB sticks. I need to see some Event Viewer screenshots of the errors and will work on a procedure for you to follow. In the meantime, you can test the memory sticks. Remove all but one and run a memory test. Test each stick and remember the good and the bad. Here is a test you can use. Ask if you have questions. John.

0 Likes

Hey John,

 

Thanks for the reply and makes sense it likely isn't power, but an issue with the memory. This seller isn't giving me their return address in english so I don't see this being a friendly RMA process even if it were a bad stick or two. But, I'm down to test each stick to see for the heck of it. 

 

Here are the error events I believe. One of which is a 'Memory' component error (event 46) which occurred first. If I blurred out something you need, let me know. Wasn't sure what might be sensitive or not.

 

event 46 WHEA-Logger details.png

event 46 WHEA-Logger.png

event 1796 TPM-WMI details.png

event 1796 TPM-WMI.png

 

Please let me know what you think and I'll work on testing the sticks.

    

0 Likes

reeflex, open the Event Viewer-expand Windows Logs-System. See below: Click Filter Current Log-Check Critical-Click OK-Select an error (maybe use time)-in the new window click Details. Inside this window right click then click Select-After all are selected right click and select Copy. Paste this into you reply. Let me know if you have questions. john.

Elog2.png

Elog3.png

0 Likes

Thanks for the instructions John. I've done what you said and provided the details on this error from today. Please let me know if you see anything.

reeflex_0-1725322487627.png

 

 

+System
  
-Provider
   [ Name]Microsoft-Windows-Kernel-Power
   [ Guid] 
  
 EventID41
  
 Version9
  
 Level1
  
 Task63
  
 Opcode0
  
 Keywords 
  
-TimeCreated
   [ SystemTime]2024-09-02T18:45:33.9362335Z
  
 EventRecordID19567
  
 Correlation
  
-Execution
   [ ProcessID]4
   [ ThreadID]8
  
 ChannelSystem
  
 Computer 
  
-Security
   [ UserID] 
-EventData
  BugcheckCode0
  BugcheckParameter10x0
  BugcheckParameter20x0
  BugcheckParameter30x0
  BugcheckParameter40x0
  SleepInProgress0
  PowerButtonTimestamp0
  BootAppStatus0
  Checkpoint16
  ConnectedStandbyInProgressfalse
  SystemSleepTransitionsToOn8
  CsEntryScenarioInstanceId48
  BugcheckInfoFromEFIfalse
  CheckpointStatus0
  CsEntryScenarioInstanceIdV248
  LongPowerButtonPressDetectedfalse
  LidReliabilityfalse
  InputSuppressionState0
  PowerButtonSuppressionState0
  LidState3
0 Likes

reeflex, I need to see several with a non zero Bug Check code and also check the Friendly View. What were you doing when this crash occurred? Any luck on the memory testing? John.

0 Likes

The test is 35% complete on the first stick; no error yet. Didn't realize how long it would take to run the Windows Memory Diagnostic haha. I'll look for non-zero bug check codes when it boots back up. 

 

I was stress testing the memory by running TestMem5 when it crashed.

0 Likes

reeflex, sorry about the time. I have run mdsched.exe on my 4x8GB and will do it again and see if it can be run faster considering how much you need to test. John.

0 Likes

no problem, thank you. 

 

there was no issue with the first stick according to WMD. It probably took a few hours to run that test.

 

All the critical errors I see in event viewer have bugcheckcode of 0 so that doesn't seem to help. 

 

I'm calling ebay in a couple hours to get help with the return of this memory so probably not worth it to continue testing (unless I somehow get stuck with this memory) at this point.

 

If I can get these returned and still feel like running 8x 128gb sticks; I'll stick with the QVL and give these a try: https://www.ebay.com/itm/196148403257 

 

Thanks for all the help John.

0 Likes

reeflex, I ran mdsched.exe this morning and it took one hour for 32 GB. So it should take four hours for each of your sticks. mdsched.exe has Options but I could not access them because I have not installed Boot drivers for my Bluetooth keyboard. If you do not have Bluetooth KB you should be able to access the Options and Esc buttons. Please give me a link to your QVL list so I can take s look. I am afraid it will be expensive for a full TB of memory. Please don't buy any memory till I take a look. Finding no Event Log entries without Bug Code zero may mean it is a power problem. This would most likely be a VRM problem. When your VRM coolers arrive, you should test with your current memory. Thanks, John.

0 Likes

Thanks John. I'll try testing another stick later if there is time. I have non-bluetooth keyboard I can use so I will check the options and see what is there.

 

The QVL list is here:

https://www.supermicro.com/en/products/motherboard/M12SWA-TF 

reeflex_0-1725380192514.png

I think this option is the only one with 128gb as an option:

reeflex_1-1725380612711.png

 

 

Thanks for checking and looking out for me as I don't want to repeat this mistake. Also, depending how expensive the RAM is I might not bother upgrading.

 

Rambling questions:

Would it be a power problem with the motherboard; like it can't provide the needed power? or the outlet can't provide sufficient power to the system? or the VRMs are indeed overheating?

0 Likes

reeflex, it now seems to be a power problem with the MB VRM. When the demand for current cannot be delivered by the VRM, the Voltage drops until the processor cannot run properly. I am still surprised you did not see some errors other than just Kernel Power. Do you have any problems with the lights flickering? Please post a picture of one of your memory stick so,  I can see the make and model. Thanks, John.

0 Likes

Thanks John.

 

I have until tomorrow to get this memory shipped off otherwise I will have to try to make it work or be stuck reselling them.

 

I finally got a response from SuperMicro after the holiday:

"After checking, AMD Ryzen Master is the utility for overclocking control. This platform doesn’t support CPU and memory overclocking. As a result, this utility may not compatible to this platform. Normally, no need to change any setting inside the BIOS for 8 x 128Gb memory. "

Should I share with them the issue I'm having and the memory I'm using?

 

Makes sense it could be a power problem.

 

Yea all critical errors are kernel power:

reeflex_0-1725386171255.png

 

 

No problem with lights flickering. My monitors don't turn off, nothing else shows signs of power issues.

 

Here is the memory:

128gb stick (2).jpg

 

 

So this likely wouldn't be solved by putting heatsinks on the VRM? The heatsinks just arrived, but I don't want to put them on if it won't help. It would take up to 24 hours for the adhesive to settle and if there is a motherboard issue, I wouldn't want to tamper with it and ruin the chance of a RMA.

0 Likes

reeflex,thanks for the picture. One of the things I do not like about RM is AMD's saying it is an overclock utility. You can tell them you only want to use it to monitor the system. You could try AIDA64, it has a free trial. The crash could be solved by heat sinks on the VRM. Did the sinks come with double stick thermal tape? If so use it so the sinks can be removed. Thermally it is not as good as the adhesive would be but close and we can see how much it lowers the temperature and still they can be removed. Please give it a try with the double stick. John.

0 Likes

I'll check with SM support about using RM to monitor the system. I can also give AIDA64 a try; is there anything in particular you want me to screenshot from this program?

 

The heatsinks did come with tape: https://www.amazon.com/dp/B0B7F3HZ1M?th=1

I assume it is double sided thermal tape? But it isn't clear on what exactly the tape is. 

IMG_9921.jpg

IMG_9920.jpg

  

0 Likes

reeflex , yea use the sticky tape that came with the sinks, then you can remove them if needed. I have used a very similar set on my MB in the past.  The M.2 sockets on my current board came with tape attached to the sinks for SSD sticks.

I really cannot think of specific values I would like to see from AIDA64. I really want to see RM while Cinebench is running. Let's put off the AIDA64 idea. Thanks, John.

0 Likes

Ok cool, I'll give the tape and heatsink a try very soon.

 

The latest reply from SuperMicro when I asked about just using RM for monitoring and when I shared the RAM I'm using and the issue is going on:

 

'Unfortunately, we don’t have any information regarding to the Ryzen Master utility.  You may need to check with AMD directly.  For the reboot issue, the root cause may relate to the temperature of the memory VRM or one of the memories caused this issue.  You can change the fan speed to full speed and test the memories by pair to verify this issue.'

0 Likes

reeflex, what fan are they talking about? If you have control, set it to run full speed.  John.

0 Likes

What wall Voltage does your system use 120 Volts or 240 Volts. Thanks, John.

0 Likes

Hey John,

 

SM support said this about the fans

"You can change the fan mode inside the IPMI GUI. Please refer to the BMC/IPMI user guide as a reference. https://www.supermicro.com/about/policies/disclaimer_manuals.cfm?url=/manuals/other/BMC_Users_Guide_...

 

I think I can access 'BMC' through the BIOS in order to change the speed to Full Speed. I'm pretty sure the fan speeds increase to full speed automatically thought based on the work load so this might not be the fix; but I'll give it a go anyways.

 

I'm using 120V outlet here.

0 Likes

Accessing/setting up the BMC seems like a bit of a pain, so I haven't done that yet. 

 

I did put the heatsinks on via the sticky tape. The temperature is doing much better with the heatsinks and 40mm fan blowing on them. 

 

But, it still crashed just over 4 minutes into the TestMem5 test with 4x 128gb sticks installed. These were the stats right before crashing. VRMEFGH is 77C here. 

reeflex_0-1725406610979.png

 

This temperature should be fine so maybe it is a RAM issue or I can't provide enough power to the system? I Just wouldn't expect 1TB of RAM to have a power issue, but not 256GB. 

0 Likes

reeflex, SM support seems to have a hangup about security including asking me every time I try to access.it. Here is an article I found about BMC and IPMI. I still have not found a usable user manual. Your system User Manual has lots about BMC. I think it is accessed via Internet. There are also videos on the Internet. Usually the fan control provides a profile define by the user. If you do not mind the noise set speed to max always. See if you can find the Event Log of today's crash. John.

0 Likes

Oh sorry, this is the guide that SM support linked to: https://www.supermicro.com/manuals/other/BMC_Users_Guide_X12_H12.pdf 

 

Thanks for finding info on setting up the BMC/IPMI. I don't think I'll be able to set it up before I have to send this memory back tomorrow though. It seems a bit difficult to setup unless I'm missing something. But it does look like something I should try to setup before trying a different batch of 8x 128gb sticks.

 

I'll check the event log of today's crash soon and post it here.

 

Thanks John.

0 Likes

It is another kernel power error with BugcheckCode 0 

reeflex_0-1725418656398.png

 

 

0 Likes

I'm packing up the memory John, thanks for all the help. 

 

This memory looks to be the one on the QVL: https://www.ebay.com/itm/196148403257 

reeflex_0-1725422694629.png

 

 

I'll likely give these a try next month if it looks good to you. What do you think?

0 Likes