The Phenom arch manual talks about how:
xor eax, eax
mov al, foo
is worse than
movzx al, foo
because of merging penalties.
But what about setCC? What kind of merging penalties exist for the setCC instructions, which can only output to 8-bit registers, and don't zero the high bits? Should I do:
xor eax, eax
setne al
or should I do
setne al
movzx eax, al
The former is faster on all the Intel chips I've tested, since the xor can be executed well in advance, but I don't know how Phenom merging penalties affect this.