I'm running a simple test that calls dgels to solve a dense matrix problem on an AMD EPYC 7313, Ubuntu 23.04 amd64.
Matrix dimensions
m = 20000, n = 18000, k = 16000
With MKL's dgels the solve takes ~ 40s
With libflame dgels the solve takes ~1m 45s, with the best configuration I was able to find experimentally:
./configure --enable-dynamic-build --enable-amd-flags --enable-lapack2flame --enable-multithreading=openmp --enable-supermatrix --enable-vector-intrinsics=sse --ena
ble-memory-alignment=8 --enable-ldim-alignment --enable-amd-opt
and FLA_NUM_THREADS=8
I've also tried the pre-compiled libflame directly from the AMD AOCL downloads page (https://www.amd.com/en/developer/aocl.html), using both gcc and aocc variants (installed via deb) with slightly worse performance to the custom compiled libflame.
A few questions:
- Are the compilation options I used above the most optimized for dgels operations?
- What are the compilation options used for the official releases?
- Would it be more performant to use a libflame function like FLA_LU_piv_solve() rather than using the lapack2flame compatibility layer?
- Related to (2) dgels seems to be not supported in lapack2flame according to the libflame documentation - https://github.com/amd/libflame/blob/master/docs/libflame/libflame.pdf page 250. So is the dgels implementation not optimized? It seems it is an f2c port of the netlib implementation? https://github.com/amd/libflame/blob/master/src/map/lapack2flamec/f2c/c/dgels.c
On another note - libblis-mt.so is awesome and runs dgemm (on the above-mentioned matrix sizes) faster than MKL's BLAS by 1 second. (10s AMD BLIS, 11s MKL). libblis.so (non-multithreaded was slow, so had to abort).