LATER: My mistake. It was my impression the vpshaq instruction takes one 64-bit shift-count specification to specify the shift for both 64-bit portions of the source operand. And so at the label big_s64_m63 my code only contained one 64-bit element of -63, like this:
My code should have had two values of -63 after the .quad directive above, one for each 64-bit portion of the source xmm0 register. And when we get AVX2 I suppose my code would have needed four values of -63 after the .quad directive. Doh! I will leave this message posted just in case someone else makes this mistake, and this message helps.
I am writing a function library in 64-bit assembly language, and the vpshaq instruction appears to not be working correctly. Either that or the documentation is wrong, which seems unlikely unless this instruction works different than similar "packed" instructions. The documentation clearly states that both 64-bit elements of the source register should be shifted (which is what "packed" means). And as far as I can tell, there is no scalar version of this instruction that might accidentally be executed.
What happens is, 128-bits from the source register are written to the low 128-bits of the destination register, but only the low 64-bits of the source register are shifted. The upper 64-bits of the source register are not shifted, but are simply passed through to the destination register unmodified. The name of this instruction in the documentation is "packed shift arithmetic quadwords", and "packed" normally means more than the low 64-bit element is operated upon.
vmovsd big_s64_m63, %xmm3 # xmm03.0 = -63 (or other shift count)
vpshaq %xmm3, %xmm0, %xmm1 # xmm01.01 = msbit of arg1
In a typical test, the first instruction loads 0x55555555555555555555555555555555 into register xmm0. That's two 64-bit values, each of which == 0x5555555555555555. The second instruction loads a shift-count into register xmm3. My original purpose is to right shift by 63 bits to fill the destination register with all 0 bits or all 1 bits depending on the most-significant bit of the two 64-bit values. But I've tried many other shift counts between -1 to -63, and some positive shift counts too (which left shift instead of right shift).
In every case, when my code executes the vpshaq instruction, the low 64-bits of the source register (xmm0) ends up in the low 64-bits of the destination register (xmm1) shifted as expected, but the next higher 64-bits of the destination register (xmm1) ends up containing the original unshifted contents of the source register (xmm0).
This is not what the documentation says, and not what would normally be expected.
I thought maybe the gcc compiler/assembler might be assembling the instruction wrong, and I suppose that is still a possibility. However, off hand I do not know of any other instruction shifts right the correct number of places (specified in the count register), and perform the sign-extension that does in fact occur (on the low 64-bits). So... it appears more like the instruction isn't working properly.
Can anyone verify this for me?
Who do I need to report this to, and how would I go about that?
I am compiling on 64-bit ubuntu linux with up-to-date gcc tools. I develop and debug my code with codeblocks IDE (which invokes standard tools like gdb).
For anyone familiar with intel syntax, note that the operand order in this assembler is reversed, so the source registers come first and the destination register comes last on each line of assembly language.
My CPU is an FX-8150 bulldozer.
I have tried many other values in the lower and upper 64-bit portions of the source xmm register (as well as different shift counts), but always only the low 64-bits is shifted and the upper 64-bits is unmodified.
I am familiar with the SIMD instruction set, and have written many functions with AVX, FMA and other advanced instruction sets that work with the xmm and ymm registers, so while I may be doing something stupid here, I would normally be able to recognize it myself. This is, however, the first time I put the vpshaq instruction in my code.