I have been tyring to figure out (from studying the AMD64 manuals) how to simply load 1 to 16 contiguous bytes of memory into an XMM register (starting with the first byte of the register). The remaining bytes of the XMM register would remain unchanged. Out of the plethora of instructions for this architecture, so far I haven't found any instructions that seem to be able to do this rather straight-forward operation. Can anyone help?
In order to use MASKMOVDQU to store the bytes you have to set up a mask in another xmm register which is also problematic. Seems the only reason to even use the xmm registers at all is for speed. But when you end up having to write so much code to do a simple thing, it might not run any faster than scalar code which defeats the purpose of using vector registers in the first place.
The ideal instruction I would need to make a mask would be to specify a 16-bit register where each bit corresponds to a byte in the mask register. If each bit in a source gp register (such as BX) were sign-extended to make the byte in the corresponding mask register then it would be very efficient. I can't find such an instruction though. SSE doesn't seem to be very well thought out. Something I read compared PPC AltiVec to X86 SSE and the the consensus is that AltiVec was much better planned out.
A less than ideal workaround would be to have a 32-byte table where the first 16 bytes are $FF and the last 16 bytes are $00. Depending how many contiguous bytes you want to store you would index into the table at the right place to load the mask. Of course this means that MASKMOVDQA would be useless since the only alligned values would be all 0's or all 1's (depending on the alignment of the table itself). But having to load 16 bytes in order to store 16 bytes makes me very unimpressed with SSE and the other XMM extensions.