This is not really a suitable task for parallelization.
@Version 1: it's only memory access and no calculations.
@Version 2: Will not work as you will overwrite your values!
buf = buf;
buf = buf; // whoops buf was alread overwritten with buf
so you would need a temp variable and synchronization, which will be slow.
I think it is a task very suitable for parallelization. To set many vector elements at the same instructions is better than setting them one at a time. And yes, one might need another buffer to avoid overwriting values.
My question still remains: how many instructions will [buf = buf.yzwx;] take? And will that be faster than executing four [buf[ID] = buf[otherID];]?
What about [buf = buf.s12306745b89acdef;] when buf is char16, how many instructions is that?
On a CPU it will use a single core. It's not going to break up the work item across multiple cores. It may be able to use SSE.
It won't break a single work item on a GPU across multiple cores either, but if you use multiple work items it will use the SIMD engine to execute it across multiple SIMD lanes. On the other hand in a single work-item it will execute on a single SIMD lane's five VLIW ALUs as a set of 4 VLIW instructions in a single instruction packet (plus possible overhead for using chars).