I found that the AMD's K15 Optimization Guide lacks information on vbroadcastXX instructions. Could you provide information (pipes, decode type, latency) for these instructions?
Besides that, it is unclear how microcoded instructions are handled on K15. Do they cause a decoder stall (for how many cycles?), a pipeline serialization, or a pipeline flush? Also, on page 111 the optimization guide advise to "Avoid the use of a microcoded 256-bit store by using vextractf128 to store the upper half of the result operand." Which 256-bit store instructions (VMOVAPS/D, VMOVUPS/D, or VMOVNTPS/D) are microcoded?