I have been getting good results using acml_mp on an 8 core opteron system, using dsyev on matrices of ~5000*5000, and can observe multiple threads running using htop.
I tried switching to ssyev to see if this would improve the speed of the calculation, but for some reason the method now only uses a single thread and takes roughly twice as long to run. Is this normal? I can understand why using single precision arithmetic on a 64 bit machine might cause issues that prevent it from running faster, but surely it should still at least be able to exploit using multiple threads?
I have downloaded the newest version of acml and double checked that I have linked to the mp version of the library.
(Checked again and to clarify, it does spawn multiple threads, but only one appears to be doing any actual work)