I have designed a small program in assembly that compares different implementations of the scalar product using the FPU, SSE and AVX instructions sets
Basically I fill two arrays of floats x and y and I compte sum of x[i] * y[i].
When I use the FPU (FLD, FMUL, FADD) instructions, my program executes for 16 seconds. On other architectures (Intel) it takes generally 10 seconds.
When I use the SSE registers working with vectors of 4 floats, it takes only 2 seconds (I use MOVDQA, MULPS, ADDPS)
So to be sure of what is happening, I decided to use the SSE registers computing one element at a time (MOVSS, MULSS, ADSS) and it executes
in 10 seconds.
So my analysis (maybe I am wrong) is that the FPU is relatively slow compared to SSE circuitry.
Does any body have any idea why ?