AnsweredAssumed Answered

ACML 5.1.0 and transparent hugepages

Question asked by mattiass on May 21, 2012
Latest reply on May 22, 2012 by mattiass

Hi!  In short: On RHEL6.2 (CentOS) with transparent hugepages enabled, I get bad performance for (e.g.) dgemm or zgemm called with certain matrix sizes when using ACML 5.1.0.  Q: Is this a known issue, and if so, are there any general recommendations and will it be addressed in the next upcoming release of ACML?  Longer description and background: I have a system consisting of dual AMD Opteron 6220 processors, running CentOS 6.2 where I have noticed unexpectedly bad performance for a specific quantum chemistry code; a commonly used plane-wave DFT code, relying to a large extent on BLAS routines. I mainly boiled down to most of the ACML-5.1.0-routines performing substantially slower than e.g. the ACML-4.4.0 counterparts (application became some 20-40% slower when linking with 5.1.0 instead of 4.4.0). E.g. the zgemm behavior seemed to contribute significantly the the loss of application performance.  Many matrix sizes seemed to affect zgemm (and also dgemm) particularly bad. One such size is transa=C transb=N m=6 n=6 k=356 lda=356 ldb=356 ldc=40  For example, repeatedly calling dgemm with the above input sizes, running a single thread, bound to a single numanode, rendered a callgraph (using perf record -g ; perf report -n -g) [trimmed]  # Overhead  Samples    Command                Shared Object     79.53%        909  dgemm-510  [kernel.kallsyms]            [k] clear_page_c                |                --- clear_page_c                   |                             |--99.56%-- do_huge_pmd_anonymous_page [...]                   |          |                             |          |--51.49%-- dmmkernbd_                   |          |                             |           --48.51%-- dmmavxblkb_                    --0.44%-- [...]       2.80%         32  dgemm-510  [kernel.kallsyms]            [k] clear_huge_page                |                --- clear_huge_page                    do_huge_pmd_anonymous_page [...]   Where as if I disable transparent hugepage on the machine, calling the routine takes something like 1/5th of the time and the excessive time spend on clearing pages in the kernel has disappeared (compare the "Samples" column # Overhead  Samples    Command                Shared Object   #      5.54%          7  dgemm-510  libacml.so                   [.] dmmavxblkta_      4.79%          7  dgemm-510  [kernel.kallsyms]            [k] unmap_vmas      4.00%          5  dgemm-510  [kernel.kallsyms]            [k] io_serial_in      5.54%          7  dgemm-510  libacml.so                   [.] dmmavxblkta_      4.79%          7  dgemm-510  [kernel.kallsyms]            [k] unmap_vmas      4.00%          5  dgemm-510  [kernel.kallsyms]            [k] io_serial_in      3.98%          5  dgemm-510  [kernel.kallsyms]            [k] page_fault      3.92%          5  dgemm-510  [kernel.kallsyms]            [k] clear_page_c      3.19%          4  dgemm-510  [kernel.kallsyms]            [k] __inc_zone_state [...]    This rather extreme difference between behavior with and without transparent hugepages is for a specific problem size, but the problem really affects the overall performance of the application. The problem seems strictly related to using acml version 5.1.0 (I have tried other compilers; ifort, gfortran, pgf90, I have tried linking dynamically and statically, I have tried fiddling with the ACML_FAST_MALLOC-variable, ...)  Now, some application may actually benefit from using transparent hugepages, so to me I doesn'ot seem a like a clear cut case whether to simply turn it off system wide (the system is a compute cluster used by hundreds of different users).  Best regards, Mattias

Outcomes