The article is quite in depth, but I wonder if Microsoft will include this type of a fix into a future update. This problem never affected Linux, which allowed them to diagnose the problem.
The CPU_MASK setting follows whatever affinity you specify with the CLI utility “start” (or whatever you change via task manager). The ideal_cpu setting, however, will only recommend CPUs from one NUMA node when using the “start” CLI. When setting the affinity via task manager, the ideal_cpu is chosen from any NUMA node, not just one.
When only one NUMA node is recommended via the “ideal CPU” the windows kernel seems to spend half the available CPU time just shuffling threads between cores. That explains the high-CPU -utilization-but-nothing-gets-done aspect of the low performance. It also means it’s a bit tricky to spot apps/threads that are flailing about this way.
Here’s an interesting twist: If you only have one OTHER NUMA node – windows seems to fall back to allowing the threads to establish themselves on the second NUMA node (the ideal CPU tag is ignored, basically).
This is most likely related to a bugfix from Microsoft for 1 or 2 socket Extreme Core Count (XCC) Xeons wherein a physical Xeon CPU has two numa nodes. In the past (with Xeon V4 and maybe V3), one of these NUMA nodes has no access to I/O devices (but does have access to memory through the ring bus).
If that’s true, then that work-around to make sure this type of process stays on the “ideal CPU” in the same socket has no idea what to do when there is more than one other NUMA node in the same package to “fail over” to.
In the case of the Threadripper 2990, there are three other NUMA nodes in the socket.
As such that algorithm seems to just aimlessly shuffle threads and that is one plausible explanation for why the Indigo performance is so much worse on the 2990WX than the 16-core 2950X.