4 Replies Latest reply on May 22, 2012 5:38 AM by mattiass

    ACML 5.1.0 and transparent hugepages

    mattiass

      Hi!  In short: On RHEL6.2 (CentOS) with transparent hugepages enabled, I get bad performance for (e.g.) dgemm or zgemm called with certain matrix sizes when using ACML 5.1.0.  Q: Is this a known issue, and if so, are there any general recommendations and will it be addressed in the next upcoming release of ACML?  Longer description and background: I have a system consisting of dual AMD Opteron 6220 processors, running CentOS 6.2 where I have noticed unexpectedly bad performance for a specific quantum chemistry code; a commonly used plane-wave DFT code, relying to a large extent on BLAS routines. I mainly boiled down to most of the ACML-5.1.0-routines performing substantially slower than e.g. the ACML-4.4.0 counterparts (application became some 20-40% slower when linking with 5.1.0 instead of 4.4.0). E.g. the zgemm behavior seemed to contribute significantly the the loss of application performance.  Many matrix sizes seemed to affect zgemm (and also dgemm) particularly bad. One such size is transa=C transb=N m=6 n=6 k=356 lda=356 ldb=356 ldc=40  For example, repeatedly calling dgemm with the above input sizes, running a single thread, bound to a single numanode, rendered a callgraph (using perf record -g ; perf report -n -g) [trimmed]  # Overhead  Samples    Command                Shared Object     79.53%        909  dgemm-510  [kernel.kallsyms]            [k] clear_page_c                |                --- clear_page_c                   |                             |--99.56%-- do_huge_pmd_anonymous_page [...]                   |          |                             |          |--51.49%-- dmmkernbd_                   |          |                             |           --48.51%-- dmmavxblkb_                    --0.44%-- [...]       2.80%         32  dgemm-510  [kernel.kallsyms]            [k] clear_huge_page                |                --- clear_huge_page                    do_huge_pmd_anonymous_page [...]   Where as if I disable transparent hugepage on the machine, calling the routine takes something like 1/5th of the time and the excessive time spend on clearing pages in the kernel has disappeared (compare the "Samples" column # Overhead  Samples    Command                Shared Object   #      5.54%          7  dgemm-510  libacml.so                   [.] dmmavxblkta_      4.79%          7  dgemm-510  [kernel.kallsyms]            [k] unmap_vmas      4.00%          5  dgemm-510  [kernel.kallsyms]            [k] io_serial_in      5.54%          7  dgemm-510  libacml.so                   [.] dmmavxblkta_      4.79%          7  dgemm-510  [kernel.kallsyms]            [k] unmap_vmas      4.00%          5  dgemm-510  [kernel.kallsyms]            [k] io_serial_in      3.98%          5  dgemm-510  [kernel.kallsyms]            [k] page_fault      3.92%          5  dgemm-510  [kernel.kallsyms]            [k] clear_page_c      3.19%          4  dgemm-510  [kernel.kallsyms]            [k] __inc_zone_state [...]    This rather extreme difference between behavior with and without transparent hugepages is for a specific problem size, but the problem really affects the overall performance of the application. The problem seems strictly related to using acml version 5.1.0 (I have tried other compilers; ifort, gfortran, pgf90, I have tried linking dynamically and statically, I have tried fiddling with the ACML_FAST_MALLOC-variable, ...)  Now, some application may actually benefit from using transparent hugepages, so to me I doesn'ot seem a like a clear cut case whether to simply turn it off system wide (the system is a compute cluster used by hundreds of different users).  Best regards, Mattias

        • Re: ACML 5.1.0 and transparent hugepages
          mattiass

          Sigh! Original message posted as text, but it seems to have got htmlized into unreadableness...

           

          Original message attached as textfile.

          • Re: ACML 5.1.0 and transparent hugepages
            chipf

            What a good way to start a week!  This is not a known issue, it will take some investigation to see what's going on. 

             

            I can't estimate a time for a resolution on this, it will depend on when we are able to look into the problem.  We will try to spend time on this soon, this looks like a problem affecting our  latest library on our latest processors.

             

            Have you tried linking against the fma4 version of the library? It will improve performance on bulldozer processors when the gemms are heavily used.  Having said that, I do not expect it to resolve the hugepages issue you are reporting.

              • Re: ACML 5.1.0 and transparent hugepages
                mattiass

                Thanks for the reply.

                 

                [...]

                problem.  We will try to spend time on this soon, this looks like a problem affecting our  latest library on our latest processors.

                 

                Have you tried linking against the fma4 version of the library? It will improve performance on bulldozer processors when the gemms are heavily used.  Having said that, I do not expect it to resolve the hugepages issue you are reporting.

                 

                And let's not forget, on the most popular OS!

                 

                Regarding the fma4-version of ACML 5.1.0, that's the one I have been using.

                I can confirm that I see no difference w r t transparent hugepages behavior between the fma4 and non-fma4 versions.

                 

                /m

                • Re: ACML 5.1.0 and transparent hugepages
                  mattiass

                  The transparent hugepage (trhp) thing seems to affect different parts of ACML differently.

                  I did a simple library call timing using ltrace (VASP, running single ranked/threaded, dynamically linked with acml) for a specific testcase, and summed up the first 1000000 calls to ACML-provided routines.

                  I compared acml 5.1.0-fma4 with and without trhp enabled, acml 5.1.0 (no fma4) without trhp, acml 4.4.0. The main program was build with pgi 12.3.

                   

                  See attachement for numbers and a few comments.

                  /m