AMD’s 32-core 2990WX Threadripper CPU has always been a bit of an uncertain proposition. While undeniably fast in certain scenarios, the chip has marked performance regressions in other tests, and doesn’t always outperform the 16-core Threadripper 2950X. Now, there’s a utility, CorePrio, that can be used to restore much of the 2990WX’s missing performance under Windows 10.
When the 2990WX shipped, the explanation for its occasional performance drops focused on its memory access system and controller configuration. The thinking was that having 32 CPU cores connected to memory across just four memory channels caused intrinsic bandwidth congestion, starving some cores for memory access. But there have been signs of scheduler problems as well — it’s been known for some months that the 2990WX performs better under Linux than when running Windows, and that’s a definite sign of an underlying OS issue as opposed to a hardware problem.
Level1Techs has published an extensive report into their investigation of performance on the 2990WX. The initial assumption that memory bandwidth congestion is responsible for lower overall performance, while not wrong in all cases, has been proven incomplete. Level1 found that the same performance regressions were present in an Epyc 7551 they tested, which had eight memory channels instead of Threadripper’s four. Again, performance under Linux was fine, but performance in Windows was impacted. But Level1 also found strange behavior associated with changing Windows CPU affinities, and how this impacted overall performance testing.
Data and chart by Level1.
What their investigation ultimately revealed is problems with how certain applications move workloads between cores in NUMA-enabled CPUs with more than one NUMA node. Level1 writes: “When only one NUMA node is recommended via the ‘ideal CPU’ the windows kernel seems to spend half the available CPU time just shuffling threads between cores.”
They continue:
Here’s an interesting twist: If you only have one OTHER NUMA node – windows seems to fall back to allowing the threads to establish themselves on the second NUMA node… This is most likely related to a bugfix from Microsoft for 1 or 2 socket Extreme Core Count (XCC) Xeons wherein a physical Xeon CPU has two numa nodes. In the past (with Xeon V4 and maybe V3), one of these NUMA nodes has no access to I/O devices (but does have access to memory through the ring bus).
If that’s true, then that work-around to make sure this type of process stays on the “ideal CPU” in the same socket has no idea what to do when there is more than one other NUMA node in the same package to “fail over” to.
The solution to this is a utility named CorePrio:
CorePrio solves this problem and allows for threads to be scheduled evenly across the CPUs rather than Windows spending all of its time trying to shuffle them across the die. It looks as though the reason for sharp performance regressions with the 2990WX was caused at least in part by Windows spending far more time moving workloads from CPU to CPU than it ever spent actually executing work. Obviously, this won’t boost Threadripper’s performance in applications where it already scaled well, but it should fix the performance regressions in multiple applications.
It’s not clear if the memory subsystem is still implicated in this yet. If threads are being misallocated on the wrong NUMA node, it’s possible that memory accesses are being run mostly or entirely through a single memory controller. This would explain why an eight-channel Epyc in NUMA mode gives the same performance (with allowance for clock speed) as a four-channel TR. And there may well be applications that don’t scale well in the 2990WX’s NUMA configuration for reasons unrelated to any shortcomings in the Windows 10 scheduler.
The full scope of the bug and its potential fixes haven’t been fully fleshed out yet, if the “fixes unknown Windows perf issue” wasn’t a clue above. Microsoft and AMD have not yet issued formal responses and it’s not clear what the timeline is for fixing this problem via OS update. But if you’re a 2990WX owner or were interested in becoming one, this could change the calculus on whether the chip is worth investing in — provided you’re a very particular kind of customer in the first place, obviously. Average and even not-so-average gamers need not apply, as chips like the 2990WX play in very rarified space to start with.
New Utility Can Double AMD Threadripper 2990WX Performance - ExtremeTech