We have actually removed the installer from ACML 6 to simplify the install process. The installer was a heavy process and seemed a bit much for installing a few libraries and .html documentation on a users machine. ACML is not really an application, and does not need to be registered on the computers Add/Remove programs uninstaller, you should just be able to delete the install by deleting the directory. Just unzip the ACML archive wherever you want on your machine, ( could be your home directory, or /user/local or /opt ) and link your application with it in the usual way.
The _fma4 directories were deprecated and removed from ACML 6. ACML does a proper CPUID feature check now and enables fma4 implementations at runtime. The documentation is stale; i'll note to change that for the next patch release.
I've heard about this problem with the CPU binding from a few sources. What can you tell me about the machine you are running on? Is it AMD hardware or Intel? Which linux distro?
Thanks for the response.
Regarding the FMA4 runtime, is it still advisable to use the "-mfma4" compile flag when building for systems that support fma4?
The machines I'm testing the threading on are of three types. AMD Opteron 2380, 6212, and 6320. Tests that ran on more than one node (MPI + Threads) were run across nodes of the same CPU. The OS is CentOS 6.5 x86_64.
The AMD systems all typically showed the CPU load at 100% on a single core, with the rest idle. The CPU utilization of each thread would be such that all the threads added up to 100%.
I just ran a test on Intel systems (Xeon E5420, 8 cores per system) and observed the same behavior.
So far the behavior observed watching the processes and threads in 'top' has been that once all the threads are created, one core is always at 100% CPU and periodically other cores would jump above 0% CPU. On 8-core AMD and Intel CPUs the CPU load of all threads would always add up to 100% and would alternate between 4 and 8 threads being the most active. When 4 of the 8 cores showed load above 0%, all 8 threads were at ~20-35% CPU. When only one CPU was above 0%, only 4 threads were showing CPU load. On the 32-core AMD systems The behavior differs slightly. One core stays at 100% and the other 31 cores sit at 0% and all 32 threads stay around 3-4% CPU load.
If I run the exact same test, only linking against "openblas" instead of "acml_mp", I see the expected behavior of all threads at 100% CPU and all cores at 100% load. HPL using ACML and threads on a single 8-core system had a runtime of 255 seconds and Gflops of only 11.5. Using OpenBLAS in same test had 47 second runtime and 62.5 Gflops. When we tested our systems in the past (CentOS 5.7) using ACML 5.3.0, we always saw much better performance with ACML.
Also I'm seeing odd behavior when trying to run HPL with multiple processes per node and only 1 thread. I set OMP_NUM_THREADS=1, and even tried linking HPL against "acml" (not acml_mp) then run my command and see that 8 processes run at 100% on all 8 cores, but I also observe that each processes is spawning 3 additional threads. They don't induce enough load to be a concern, but want to be sure I understand the behavior before I can document for our users how to utilize ACML for their applications.
I'm trying to reproduce this in my lab now.
I forgot to answer your question about fma4. ACML 6 is built with feature flag detection, so we can detect which optimized codepath to take at runtime. The fma4 compile flag you mention above affects 'your' code, and generates fma4 instructions in your binary. If you have no need for your code to be portable across other non-AMD platforms, this is OK and should be a performance boost for you.
I have a machine set up to attempt to reproduce this binding problem, it's a 48-core AMD machine with 6175 SE processors.
Can you tell me exactly how you are running HPL? I want to make sure that I am reproducing what you describe. Here is what I see:
mpirun -np 4 ./xhpl
which appears to load about 4 cores with computation.
i also tried
mpirun -np 48 ./xhpl
and this appears to load 1 core with computation.
But, if I try
mpirun -np 12 ./xhpl
I still appear to only be running on 4 cores
If I may ask; do you have a handy way to see which threads are binding to which core? I am using 'htop' with the 't' option and 'H' option, but I don't see all the threads that I expect. It may not refresh fast enough
Do you get any idea of what or how are those "ghost" threads spawned?
I'm having the same problem with not only hpl but for all applications in which 1 process creates 2 or 3 threads even though I've set not to. So.. I couldn't tell if they're messing with the program's performance either.
I have not looked into the ghost threads, and have no idea what causes them. When I observed them they were producing no load when observed via 'top' and looking at output in 'ps'.
Try using the new release of ACML 6.0.6; I've removed all the code that bound threads to cores, so I'm curious if you find the performance profile changing on your systems.
Thank you! I just did a small scale HPL run using 6.0.6 vs OpenBLAS and with a tiny problem size in HPL the ACML compiled code runs faster than OpenBLAS and is loading all cores. With a larger problem size in HPL all cores reach 100% load as expected. Appears your fix for 6.0.6 worked.
In reference to Crash's response here are some observations:
With HPL I notice that if I run with 1 process per node and set OMP_NUM_THREADS=32 on a 32-core system, the process starts with 1 process and 3 additional threads. The threads never show any CPU consumption. Once the needed memory is filled by HPL, all the threads spawn and those 3 extra threads are still around, but still appear to do nothing.