8 Replies Latest reply on Sep 26, 2014 6:31 PM by treydock

    Installing ACML6 and using threads with HPL


      I have recently upgraded to ACML (previously was using 5.3.0) and have run into a few problems.  First I noticed no documentation on how to properly install ACML6, and assumed the contents of the downloaded tar file just go into what would be the install prefix.  I checked the documentation under the file acml- and found that it mentions install paths that do not appear in the tar file, in particular the "fma4" suffixed items.  Is the documentation stating those directories are the ideal location for the files to be installed, or is it remnants from version 5 that supplied the "_fma4" directories?


      The issues I've had testing ACML6 performance have been with HPL 2.1 compiled using gcc-4.8.2 and either OpenMPI-1.8.2 or MVAPICH2-2.0.  What I've found is that the desired number of threads are spawned but only one core is being used.  I'm curious what can be done to debug this, or what information I can provide to find the cause of this issue.  Compiling HPL with something like OpenBLAS in the same way, just modifying the LAinc and LAdir options, does not have the same issue of binding to a single core.



      - Trey


      HPL Makefile for ACML:


      MPdir        = $(MPIHOME)

      MPinc        = -I$(MPdir)/include

      MPlib        = -L$(MPdir)/lib64 -lmpi

      LAdir        = $(ACML_MP_ROOT)

      LAinc        = -I$(LAdir)/include

      LAlib        = -L$(LAdir)/lib -lacml_mp

      CC           = mpicc

      CCNOOPT      = $(HPL_DEFS)

      CCFLAGS      = $(HPL_DEFS) -fomit-frame-pointer -funroll-loops -O3 -mfma4 -W -Wall -lpthread

      LINKER       = $(CC)




      ACML_MP_ROOT = /apps/gcc-4.8.2/acml-gfortran64/

        • Re: Installing ACML6 and using threads with HPL

          Hi treydock,


          We have actually removed the installer from ACML 6 to simplify the install process. The installer was a heavy process and seemed a bit much for installing a few libraries and .html documentation on a users machine.  ACML is not really an application, and does not need to be registered on the computers Add/Remove programs uninstaller, you should just be able to delete the install by deleting the directory.  Just unzip the ACML archive wherever you want on your machine, ( could be your home directory, or /user/local or /opt ) and link your application with it in the usual way.


          The _fma4 directories were deprecated and removed from ACML 6.  ACML does a proper CPUID feature check now and enables fma4 implementations at runtime.  The documentation is stale; i'll note to change that for the next patch release.


          I've heard about this problem with the CPU binding from a few sources.  What can you tell me about the machine you are running on?  Is it AMD hardware or Intel?  Which linux distro?

            • Re: Installing ACML6 and using threads with HPL

              Thanks for the response.


              Regarding the FMA4 runtime, is it still advisable to use the "-mfma4" compile flag when building for systems that support fma4?


              The machines I'm testing the threading on are of three types. AMD Opteron 2380, 6212, and 6320.  Tests that ran on more than one node (MPI + Threads) were run across nodes of the same CPU.  The OS is CentOS 6.5 x86_64.


              The AMD systems all typically showed the CPU load at 100% on a single core, with the rest idle.  The CPU utilization of each thread would be such that all the threads added up to 100%.


              I just ran a test on Intel systems (Xeon E5420, 8 cores per system) and observed the same behavior.


              So far the behavior observed watching the processes and threads in 'top' has been that once all the threads are created, one core is always at 100% CPU and periodically other cores would jump above 0% CPU.  On 8-core AMD and Intel CPUs the CPU load of all threads would always add up to 100% and would alternate between 4 and 8 threads being the most active.  When 4 of the 8 cores showed load above 0%, all 8 threads were at ~20-35% CPU.  When only one CPU was above 0%, only 4 threads were showing CPU load.  On the 32-core AMD systems The behavior differs slightly.  One core stays at 100% and the other 31 cores sit at 0% and all 32 threads stay around 3-4% CPU load.


              If I run the exact same test, only linking against "openblas" instead of "acml_mp", I see the expected behavior of all threads at 100% CPU and all cores at 100% load.  HPL using ACML and threads on a single 8-core system had a runtime of 255 seconds and Gflops of only 11.5.  Using OpenBLAS in same test had 47 second runtime and 62.5 Gflops.  When we tested our systems in the past (CentOS 5.7) using ACML 5.3.0, we always saw much better performance with ACML.


              Also I'm seeing odd behavior when trying to run HPL with multiple processes per node and only 1 thread.  I set OMP_NUM_THREADS=1, and even tried linking HPL against "acml" (not acml_mp) then run my command and see that 8 processes run at 100% on all 8 cores, but I also observe that each processes is spawning 3 additional threads.  They don't induce enough load to be a concern, but want to be sure I understand the behavior before I can document for our users how to utilize ACML for their applications.



              - Trey

                • Re: Installing ACML6 and using threads with HPL

                  I'm trying to reproduce this in my lab now.

                  • Re: Installing ACML6 and using threads with HPL

                    Hi treydock~


                    I forgot to answer your question about fma4.  ACML 6 is built with feature flag detection, so we can detect which optimized codepath to take at runtime.  The fma4 compile flag you mention above affects 'your' code, and generates fma4 instructions in your binary.  If you have no need for your code to be portable across other non-AMD platforms, this is OK and should be a performance boost for you.


                    I have a machine set up to attempt to reproduce this binding problem, it's a 48-core AMD machine with 6175 SE processors.


                    Can you tell me exactly how you are running HPL?  I want to make sure that I am reproducing what you describe.  Here is what I see:


                    export OMP_NUM_THREADS=12

                    mpirun -np 4 ./xhpl

                    which appears to load about 4 cores with computation.


                    i also tried

                    export OMP_NUM_THREADS=1

                    mpirun -np 48 ./xhpl

                    and this appears to load 1 core with computation.


                    But, if I try

                    export OMP_NUM_THREADS=4

                    mpirun -np 12 ./xhpl

                    I still appear to only be running on 4 cores


                    If I may ask; do you have a handy way to see which threads are binding to which core?  I am using 'htop' with the 't' option and 'H' option, but I don't see all the threads that I expect.  It may not refresh fast enough

                    • Re: Installing ACML6 and using threads with HPL

                      Hello treydock.


                      Do you get any idea of what or how are those "ghost" threads spawned?

                      I'm having the same problem with not only hpl but for all applications in which 1 process creates 2 or 3 threads even though I've set not to. So.. I couldn't tell if they're messing with the program's performance either.


                      Any idea?



                      • Re: Installing ACML6 and using threads with HPL

                        Try using the new release of ACML 6.0.6; I've removed all the code that bound threads to cores, so I'm curious if you find the performance profile changing on your systems.



                          • Re: Installing ACML6 and using threads with HPL



                            Thank you!  I just did a small scale HPL run using 6.0.6 vs OpenBLAS and with a tiny problem size in HPL the ACML compiled code runs faster than OpenBLAS and is loading all cores.  With a larger problem size in HPL all cores reach 100% load as expected.  Appears your fix for 6.0.6 worked.


                            In reference to Crash's response here are some observations:


                            With HPL I notice that if I run with 1 process per node and set OMP_NUM_THREADS=32 on a 32-core system, the process starts with 1 process and 3 additional threads.  The threads never show any CPU consumption.  Once the needed memory is filled by HPL, all the threads spawn and those 3 extra threads are still around, but still appear to do nothing.


                            - Trey