My C code calls ACML 4.4.0 dsbtrd (a LAPACK routine) with 1-4 OpenMP threads on a quad-core Opteron ("Budapest"). I'm linking against gfortran64_mp/libacml_mp.so. When I replace the dsbtrd call with a dgemm, I see the strong scaling you'd expect, going from 1 to 4 OpenMP threads. But when I call dsbtrd, I see pretty much constant performance scaling, over all problem sizes I tested (N=10:10000, KD=5:600).
The ACML release notes claim that dsbtrd was parallelized with OpenMP in version 3.6.0. I confirmed this by breaking my link line (removing -fopenmp), and seeing link errors "GOMP_parallel_for", etc, coming from dsbtrd within acml_mp.so.
Has anyone seen improvement (scaling) from ACML dsbtrd, from increasing OpenMP parallelism?