I have a simple C code that solves the Navier-Stokes equations on a 1x1 grid using finite elements. It's very simple, and it uses the LAPACK routine {S,D}GESV to solve a linear system of equations. I compiled it against the ACML library, and I'm finding it to be twice as slow at the reference LAPACK/BLAS implementations.

I've tried both the single- and double-precision functions, as well as the 32-bit ACML (compiled against gcc 4.1 on an Athlon 64 X2 5200+) and the 64-bit ACML (compiled against gcc 3.4 on an Opteron 252), and in every instance the reference implementation is at least twice as fast.

I'm just using ACML as a drop-in replacement. Is there something else I should be doing?

I've tried both the single- and double-precision functions, as well as the 32-bit ACML (compiled against gcc 4.1 on an Athlon 64 X2 5200+) and the 64-bit ACML (compiled against gcc 3.4 on an Opteron 252), and in every instance the reference implementation is at least twice as fast.

I'm just using ACML as a drop-in replacement. Is there something else I should be doing?

I ran a performance comparison plot for ACML 3.6.1 gfortran 4.1.2 on a 64-bit linux Opteron 275 machine. I'll attach it if I can figure out how.

The plot shows that ACML is faster for problems larger than 16, and very much faster for problems larger than 512.

That's when run with N=NRHS, i.e. for square problems.

Be careful of link order. DGESV has many dependencies and I found it a minor challenge in my test harness to isolate ACML vs. Netlib (mainly because my test harness has other dependencies on ACML).

Try running the performance examples for DGETRF. It can be easily modified to include a call to DGETRS, or replace both calls with DGESV.

If you're still having problems and can extract a small example program that demonstrates the issue, please send it to the tech support email.

Here's the start of the results I got:

Netlib

# N Time (secs.) Mflops

16 0.00002952 370.0

32 0.00015070 579.8

48 0.00040415 729.7

64 0.00082586 846.5

80 0.00155701 876.9

96 0.00256218 920.8

112 0.00395450 947.4

500 0.48200178 691.6

1000 4.57804108 582.5

ACML 3.6.1

# N Time (secs.) Mflops

16 0.00002398 455.6

32 0.00010427 838.0

48 0.00026211 1125.2

64 0.00048259 1448.5

80 0.00091302 1495.4

96 0.00144796 1629.4

112 0.00204141 1835.2

500 0.12738490 2616.7

1000 0.87634659 3042.9