There could be a couple of things going on. First, you may need to set OMP_NUM_THREADS to the number of cores available on the system. If "top" shows all threads working, then this is working as expected. But even if it is running with just one thread, then it shouldn't run slower than the single threaded version.
If the kernel is not the latest, you might try turning off address space randomization, as shown in this gcc wiki article:
With 5.2.0, you can also try using the "non-fma4" library. The library in ifort4_mp/lib will use FMA4 GEMM kernels if it detects the FMA4 instruction set. This might reduce the need for a Bulldozer specific build. This by itself won't solve the problem, but might make configuration a bit easier once the performance issue is resolved.
I tried multiple thread counts and affinity settings (with numactl). 1,2,8 cores and 64 cores. That didn't make a difference.
ASR was turned off (set to zero) with sysctl.conf. (We were running a RHEL6.2 2.6.32-23 kernel that works well with ACML5.2/fma4 with HPL)
I did not try the non-fma4 library. It seems like you think the non-fma4 library won't cause any change, but I'll try it if you think it is worth it.
Thanks for the response.