I want to make sure I understand vector types and how they execute.
Assume a, b, c are int4.
If I write
c=a+b;
then all four components are added pairwise simultaneously in a single thread processor, in a single instruction, using the four "normal" stream cores.
If, on the other hand, I declare ax, ay, az, aw, bx, ... as int and write
cx = ax + bx;
cy = ay + by;
cz = az + bz;
cw = aw + bw;
then in theory the compiler could optimize this by essentially figuring out to organize the storage the same way as the int4 and add them the way it does the int4, but that's a hell of an optimization to count on, esp when you can insure the optimization using int4.
In this correct?