The figure below is the floating point performance vs time. The x axis is time, y axis is GFLOPS, y = 2*n^3/time consumed for BLAS function dgemm(). I have tested serval processors. The performance of TR1950x(the three upper curves) is obviously fluctuant while others(the other three curves) are relatively stable. So what's the reason?
Another question. The peak GFLOPS of TR firstname.lastname@example.orgG is 480G(AIDA64), 3.75G*16cores*8ops=480G, my best result is 380G(average),414G(max) in dgemm().Ryzen can only do one 256bit FMA per cycle(throughput). It's only the half of intel's solution. So why? cost,power consumption or heat?
Windows 10. OpenBLAS v0.2.19, OPENBLAS_CORETYPE=EXCAVATOR.
Maybe it's because of 4k aliasing(in intel's word).
On windows, set LDC(leading dimension of (matrix) C) =4096+64 rather than 4096. the performance of TR 1950X become much more stable.
But on linux(kali), the performance is always stable no matter LDC can be divided by 4K or not, except sharp slopes every few minutes.
I'm a little confused now...