The article is quite in depth, but I wonder if Microsoft will include this type of a fix into a future update. This problem never affected Linux, which allowed them to diagnose the problem.
https://level1techs.com/article/unlocking-2990wx-less-numa-aware-apps
The CPU_MASK setting follows whatever affinity you specify with the CLI utility “start” (or whatever you change via task manager). The ideal_cpu setting, however, will only recommend CPUs from one NUMA node when using the “start” CLI. When setting the affinity via task manager, the ideal_cpu is chosen from any NUMA node, not just one.
When only one NUMA node is recommended via the “ideal CPU” the windows kernel seems to spend half the available CPU time just shuffling threads between cores. That explains the high-CPU -utilization-but-nothing-gets-done aspect of the low performance. It also means it’s a bit tricky to spot apps/threads that are flailing about this way.
Here’s an interesting twist: If you only have one OTHER NUMA node – windows seems to fall back to allowing the threads to establish themselves on the second NUMA node (the ideal CPU tag is ignored, basically).
This is most likely related to a bugfix from Microsoft for 1 or 2 socket Extreme Core Count (XCC) Xeons wherein a physical Xeon CPU has two numa nodes. In the past (with Xeon V4 and maybe V3), one of these NUMA nodes has no access to I/O devices (but does have access to memory through the ring bus).
If that’s true, then that work-around to make sure this type of process stays on the “ideal CPU” in the same socket has no idea what to do when there is more than one other NUMA node in the same package to “fail over” to.
In the case of the Threadripper 2990, there are three other NUMA nodes in the socket.
As such that algorithm seems to just aimlessly shuffle threads and that is one plausible explanation for why the Indigo performance is so much worse on the 2990WX than the 16-core 2950X.
I'd contest that Microsoft codes for Xeon: Here is what can go wrong if you try to compile Chromium on a 24-core Xeon Workstation
On the topic of poor 2990WX performance, I think it is also AMD Marketing being, well, AMD Marketing, which has kept reviewers and readers in the dark about the causes of the performance scaling anomalies for so long.
For one, AMD seems to have no interest that the mainstream press runs Linux benchmarks on the 2990WX. The Reviewer's Guide does not even mention Linux (this fact was originally disclosed by German computer news magazine Heise).
When Hardware Unboxed started to publish 2990WX Linux results, it was after being urged by their readers and Patreon supporters. Nobody from AMD even suggested this.
Finally, Kyle Bennett from HardOCP reported about his involvement in investigating various 2990WX performance problems (Adobe Premiere Pro, Handbrake), and his AMD contacts never indicated that they were aware of the possibility of an issue like the one Le....
I would blame Amd more than MS in this case. Amd should have done some rigorous testing, and worked with MS on the fix before release of the CPU.
Likely they did and came up with the same results quite a lot of smart people did, that it was a failing of the architecture in Windows caused by coding for Intel XCC Xeons, and it wasn't until the Linux for Windows feature did it allow these really smart people to really home in on the cause. Microsoft has a vested interest in assuring these chips run as well as they do on Linux as well, since they want these people to be using Windows 10.
Wandering if anyone is interested in pursuing some sort of Class Action suit against AMD for this defective design. It seems obvious the processor is not fit for use with Windows with this performance scaling issue. It doesn't seem they are pushing Microsoft to fix the problem. Therefore, we've paid almost $2,000 for a crippled processor. Don't get me started on why this processor will only run with memory dumbed-down to 2133mHz... Another defect.