I am having a problem when using multithreaded ACML routines within an MPI parallel code on an AMD bulldozer (4x8cores) shared memory system. The problem is that all the ACML threads within the various MPI processes are bound to the same core.
Let me illustrate this with an example.
Assume I link my MPI program with the sequential ACML library and launch the execution with 16 MPI processes. If I launch the "top" command I see 16 processes evenly spread out among the available cores, i.e., each bound running on a different core.
Then I redo the same experiment but this time I link with the threaded ACML library and run with OMP_NUM_THREADS=1 and still 16 MPI processes. Now in "top" I see 16 processes all running on the same core.
I assume this has something to do with the way ACML binds threads to cores. Am I doing anything wrong or is this a problem in ACML?