Hi,

I built a new quad GPU system with AMD Threadripper 1950x. But to my surprise, the numpy linear algebra performance is extremely slow (2x slow than even i7-6700k ...), not even have to mention i9 series. Not just numpy, PyTorch uses Magma, the SVD operation in Magma uses CPU too. Now this threadripper CPU becomes a huge bottleneck of our server.

I did some benchmark with python2 came with anaconda distribution. With the prebuild numpy (linked to rt_mkl), the performance is shockingly bad as I mentioned. With nomkl package, the openblas performance is slightly better with more thread, but still pretty bad. I have heard a lot of good things about threadripper, but maybe scientific computing is not what it is for.

Since this is only our experimental server with an AMD CPU, we can switch to Intel i9 or dual xeon platform without too much trouble. But please let me know if there is a way to make Threadripper get an okay performance with numpy. As long as it can get 80% of an i9-7900x performance on linear algebra, we will keep this CPU. If anyone can give me some advice, we will appreciate that!

so the linear algebra operation is done on CPU? It's highly likely that numpy invokes MKL, which is not friendly to AMD CPUs.

The solution is to replace MKL with OpenBLAS(if OpenBLAS provides all the functions you need and numpy agrees....), set environment variable OPENBLAS_CORETYPE=ZEN(or EXCAVATOR if ZEN is not supported), OPENBLAS_NUM_THREADS=16(if SMT is disabled).

a properly configured openblas is much faster than MKL on TR 1950x.

The performance of TR 1950x is sensitive to the memory bandwith. 2400MHz*4-chanel at least.