Server Processors

igrqb · ‎06-14-2023

I'm running a simple test that calls dgels to solve a dense matrix problem on an AMD EPYC 7313, Ubuntu 23.04 amd64.

Matrix dimensions

m = 20000, n = 18000, k = 16000

With MKL's dgels the solve takes ~ 40s

With libflame dgels the solve takes ~1m 45s, with the best configuration I was able to find experimentally:

./configure --enable-dynamic-build --enable-amd-flags --enable-lapack2flame --enable-multithreading=openmp --enable-supermatrix --enable-vector-intrinsics=sse --ena
ble-memory-alignment=8 --enable-ldim-alignment --enable-amd-opt

and FLA_NUM_THREADS=8

I've also tried the pre-compiled libflame directly from the AMD AOCL downloads page (https://www.amd.com/en/developer/aocl.html), using both gcc and aocc variants (installed via deb) with slightly worse performance to the custom compiled libflame.

A few questions:

Are the compilation options I used above the most optimized for dgels operations?
What are the compilation options used for the official releases?
Would it be more performant to use a libflame function like FLA_LU_piv_solve() rather than using the lapack2flame compatibility layer?
Related to (2) dgels seems to be not supported in lapack2flame according to the libflame documentation - https://github.com/amd/libflame/blob/master/docs/libflame/libflame.pdf page 250. So is the dgels implementation not optimized? It seems it is an f2c port of the netlib implementation? https://github.com/amd/libflame/blob/master/src/map/lapack2flamec/f2c/c/dgels.c

On another note - libblis-mt.so is awesome and runs dgemm (on the above-mentioned matrix sizes) faster than MKL's BLAS by 1 second. (10s AMD BLIS, 11s MKL). libblis.so (non-multithreaded was slow, so had to abort).

shrjoshi · ‎06-15-2023

Hello @igrqb

Thank you for writing to us.
Please find the reply below for your queries

1. Are the compilation options I used above the most optimised for dgels operations?
Recommended configuration options are as in the AOCL User Guide. Following is sufficient to enable optimal flags for AMD CPUs
$ ./configure --enable-amd-flags

2. What are the compilation options used for the official releases?
Same as one recommended above

3. Would it be more performant to use a libflame function like FLA_LU_piv_solve() rather than using the lapack2flame compatibility layer?
AMD fork of libflame has mostly focused on improvements and optimizations for the standard LAPACK interface. So we recommend to use the same and not the FLAME interfaces.

4. Related to (2) dgels seems to be not supported in lapack2flame according to the libflame documentation - https://github.com/amd/libflame/blob/master/docs/libflame/libflame.pdf page 250. So is the dgels implementation not optimized? It seems it is an f2c port of the netlib implementation? https://github.com/amd/libflame/blob/master/src/map/lapack2flamec/f2c/c/dgels.c

Yes. DGELS in libflame is f2c port of netlib. DGELS is not optimized yet for AMD CPUs. Neither does it support a multi-thread implementation. Hence its expected it will be slower than MKL’s multi-thread implementation. We will consider this for optimization in future releases. However, please have a look at the releases section in AMD GitHub for the improvements done in other APIs. Releases · amd/libflame · GitHub

View solution in original post

shrjoshi · ‎06-15-2023