I have an engineering team using a Dell server with windows on it for Thermal Desktop analysis. It is a dual CPU system with 128 cores. When they run any of their tests, the system will only use one socket and 64 cores. The thermal desktop team recommended asking Dell if there is a BIOS setting that could change this or an environment variable that prevents it from working. The windows environment variable NUMBER_OF_PROCESSORS says the value is 64, but I am not sure if this is really an issue. I can run prime95 for example, and it will create all 128 threads and use all CPUs and cores.
If anyone has an idea of what might be preventing it from working, I would appreciate the input. Thank you very much.
Have you or your engineering team looked into if the software package is capable of going beyond 64 threads? I imagine it can span multiple NUMA nodes and multiple processors, but perhaps the software vendor has a cap on the number of CPU cores or threads it can handle. It sounds like you've proven with prime95 that the platform in general can run applications on all cores.
When the thermal application is running, have you monitored with task manager? You can have task manager display CPU threads by NUMA node, and I wonder if the processes are being distributed to specific NUMA nodes/cores, staying only on the first socket. Or if spread evenly across sockets, but still using only 1/2 the threads (so no SMT threads).
Thank you for the reply. We have reached out to the developer of thermal desktop and they informed us that it should not have a limitation on threads/core usage. I just tested it with my engineer and even if he runs three instances of the software, all of them go to the one socket/numa node. He did open a second software that was not thermal desktop and it was able to use the cores on the second socket and all 128 got used by both programs. I am pretty convinced that it is a software limitation, but I am not familiar enough with the exact way that software is having its processes managed by windows. The vendor mentioned adding a system variable called OMP_NUM_THREADS and set it equal to 128. It did not change the outcome. But just for testing I made it's value equal to 32 and it did limit the software to only 32 cores, but wont go past 64.
Thank you for your help, but I do not think there is a fix currently. It must be how their software is written and interacts with the OS and CPU controller. I just thought I would ask in case there was someone out there more knowledgeable than myself that might know a fix.
Hello, so my $0.02: Starting with the obvious; please ensure you are using all the latest software updates
(including your application, Windows Updates, and latest BIOS from Dell).
Next, are you using Windows Server 2016 or Windows Server 2019 (and do you enable SMT in BIOS) ?
Windows Server 2016 has limit on the number of logical processors it will support.
There are also a variety of other BIOS settings e.g. NUMA Nodes per Socket (NPS), etc.
that may affect how topology is abstracted by the Windows operating system.
The Windows Server Tuning Guide for AMD EPYC 7002 provides a good overview:
Meanwhile, you should be able to go into Task Manager, select the Performance tab, and check the OS' view of processors.
Given that you feel prime95 seems to function correctly, your Thermal Desktop Analysis Application may need to be further examined.
Windows has have a notion Processor Groups which can only contain a maximum of 64 logical processors, so this may require special attention depending on how you set NPS,
please see: https://docs.microsoft.com/en-us/windows/win32/procthread/numa-support
Meanwhile, hope this helps.