Perhaps, some of them are already exist in K10-45nm...
Well, of course, I'm not a one of those guys who architect chips
. But reading the K10 optimization manual (#40546), especially Appendix C "Instruction Latencies", I have thought that several instructions could enlarge its throughput, if AMD will (slightly?) improve the FSTORE unit:
1) Almost all the "MOVxxx xmmreg1, xmmreg2" forms like: MOVSS/D, MOVLHPS/D, MOVHLPS/D, MOVSLDUP, MOVSHDUP. The most important instructions here are MOVSS/D, which are frequently used in a MSVC-generated code.
2) Next target is a data shuffling instructions (PACKxxxx, UNPCKxxxx, xSHUFxxx). I'm not quiet sure about a difficulty of implementation of these instructions into the FSTORE, but I think that it is somewhat easier than the whole FADD.
3) Last target is a logical 128-bit operations (xANDx, xORx, etc). Arguments are the same as in 2).