I am benchmarking some hybrid MPI-OpenMP code we developed on several platforms using several compilers. When using opencc (4.5.1-1 AMD patched, both compiled from source and pre-compiled) I see that threads lump into the same core, which obviously leads to a poor performance. This happens on a 2 socket machine using CPU Opteron 6128 (so 16 processors) and OpenMPI (versions 1.4.4 and 1.6) running an updated Ubuntu server.
However, this issue does not happen when using any other compiler. Explicitly, I tested GNU's gcc 4.6, Intel's icc 12.0, and even the community developed Open64's opencc 5.0.
Find attached a sample code that fails on placing correctly the instances, as well as two snapshots of htop that show the placement of the instances (wrk), when running this code using two threads and two processes.
I was recommended by AMD guys to use your compiler for best results, and we'd obviously love to publish best results for it.
Do you have any advice?