I tried to benchmark DGEMM & SGEMM operations of 32-bit ACML openMP library compiled by Fortran compiler on 2x 4-core Intel Xeon 5420 with 32-bit Windows XP SP3.
Version 4.2.0 and 4.1.0 did not run at all.
Version 4.0.0 ran surprisingly faster than Intel MKL 10.1.0.018 on larger matrices (256x256 for SGEMM and 2048x2048 for DGEMM). ACML's SGEMM operation on 4096x4096 matrices was nearly 5-times faster than MKL with 8 threads!
I experimented with setting different number of threads by calling omp_set_num_threads function before calling DGEMM or SGEMM routine, but could not get any difference in performance of ACML. No change with OMP_NUM_THREADS environment variable either. MKL was always affected when changing number of threads, so my code should be ok.
Some idea how to set number of threads for ACML 4.0.0 on Intel Xeon 5420?