Hi everyone,
I am a software engineer and researcher who works with Linux. I am running my own, compute-intensive, heavily threaded, C++ code which makes use of AVX2 instructions.
I've recently noticed some weird behavior when running my software, namely unexplained frequency drops under load. The problem seems to manifest with hyperthreading (empirically, I would say when the number of threads > 24). Here is a graph showing the problem in the top chart, program running with 32 threads. The red line shows temperature, so we can see that the frequency goes down even if the temperature is under 60C. The other two charts show other systems where everything is fine. For these reasons I suspect that I have a hardware problem and not a software one.
I should mention that other software runs fine:
System specs:
What I have tried, without any success:
I am excluding a software implementation problem since I've tested the same software (built using the same compiler, same flags) on other systems where this issue doesn't occur.
I would really appreciate any advice and, in case there is a hardware problem, is it the CPU or the motherboard?
Thanks!
foolnotion, I might have a comment if I could see Ryzen Master (RM) during the speed dips, but I am fairly sure it does not run under Linux. Can you at least run RM and Cinebench under W11 and post a screenshot of RM? Is there any way you can borrow another 5950X or MB? The graphs alone would lead me to believe it was a temperature problem. I do not have a clue how Linux handles BIOS parameters like Max Operating Temperature (Tjmax) but if it is not set correctly in the BIOS this is the symptom I would expect to observe. It is specified as 90C. Thanks and enjoy, John.
I was able to reproduce the behavior on a similar system. My best guess so far is that the observed behavior is a consequence of contention due to hyper-threading and memory/cache access patterns under heavily vectorized/SIMD workloads. Here are the two systems:
I still find it extremely weird that the frequency goes down so far, but at least I know that its not a hardware fault. It has to be a combination of specific workload and architecture-specific ryzen 5950X limitations, since the 3950X CPU I tested is not affected.
Will keep posting updates if i discover anything new.
foolnotion, I run my 3970X with NUMA mode enabled with a minor increase in performance. You might try this. Enjoy, John.
Thanks for the advice, but I tried every possible NUMA mode in the bios and the problem reoccurs every time.
foolnotion, if it is possible to reproduce in Windows, then please do and post a screenshot of Ryzen Master (RM). Thanks, John.
Hi,
I was able to run in Windows/WSL, taking screenshots of Ryzen Master, the frequency still goes down (my app running in the backgroung on 32 threads): https://imgur.com/a/pVNJWg8
Thanks,
Bogdan
Thanks, Bogdan. I will spend some time understanding the images. At fist glance your Ryzen is throttling due to several limits imposed by the BIOS (red and yellow meters). I will return shortly. Thanks, John.
This is a friend's computer, but the original issue occurs on mine where the limits are much higher (right now I use PPT/EDC/TDC 250/140/170) but I also used motherboard limits, it didn't change anything except that motherboard limits cause more heat.
Bogdan, please explain the multiple RM images. Thanks, John.
Sorry, I should have explained. The images are taken in sequence, with the program running in the background (first image at the top, last image at the bottom). We can see that we start with 3.7Ghz and then downclocking to ~2.5Ghz. I took multiple screenshots at intervals of a few seconds apart (see clock in the bottom right for a timestamp) in order to capture whatever pattern occurs in there (if there is any discernible pattern at all).
Thanks, Bogdan. I have not been able to explain what is happening. Please open a support request with AMD here. Here is the site to request to RMA your processor. I did notice you are not using a Profile on your memory. I also use G.Skill memory and use an XMP profile to run it faster. Thanks and enjoy, John.
EDIT: Since you are running on Windows, please try increasing the PPT/EDC/TDC limits using RM. Maybe it has a secret we do not know.
@misterj wrote:Thanks, Bogdan. I have not been able to explain what is happening. Please open a support request with AMD here. Here is the site to request to RMA your processor. I did notice you are not using a Profile on your memory. I also use G.Skill memory and use an XMP profile to run it faster. Thanks and enjoy, John.
EDIT: Since you are running on Windows, please try increasing the PPT/EDC/TDC limits using RM. Maybe it has a secret we do not know.
Why is @misterj allowed, knowing in advance that he cannot help OP, to request an OS change due to an absurd obsession about Ryzen Master screenshots? IMO should be limits, the lack of experience in Support Forums of Newcomer's should not be taken advantage of for personal purposes or simple entertainment.
@foolnotionand future Linux users looking for help here:
Installing Windows to compare how your hardware reacts versus Linux is always interesting but real help will NEVER ask you to abandon your host OS, if this happens it is most likely that the other user will not be able to offer you a solution, therefore it is your decision to waste your time ... or not. Good luck!
Rest assured I have not abandoned my OS. I have not installed windows on my machine, instead I asked a friend to help me test the issue on his machine which has a similar configuration and the same CPU. It was not so unreasonable to verify the output of ryzen master (since in any case it does offer more info than linux sensors output).
Based on this test, I can conclude that the problem occurs on both machines, mine (linux), my friend's (linux booted from a live iso) and also my friend's windows (running the software in WSL).
Given that I can reproduce this on another machine, I don't think it's a hardware problem with my CPU (and even if it was, I bought my 5950X CPU at the end of 2020, so its out of warranty, no point even trying an RMA).
I guess the next step is to make an official request for technical assistance to AMD. Thanks everyone for your help.
@Volanaththis is not the right place for your question, please make another top-level forum post with your issue.
Okay i understand.
But i don't know how can i use label when I post something.
Please tell me please help me.
Volanath, LABEL?? Please open a new issue by clicking "Start a new Discussion". Post a screenshot of Ryzen Master (RM). Thanks, John.