Which is the highest theoretical FLOP/(cycle*core) rate on EPYC processors when not using vectorization and FMA, for example when using x87 instructions or SSE instructions utilizing only the first value?
Hi richardw,
With EPYC processors, SSE or AVX, scalar non-fma you can do 2 mul's + 2 add's per cycle for a total of 4 per cycle.