In my kernel I have vector and scalar computation intermixed like that:
...
vector operations on whole vectors (eg. int4)
scalar operations on individual vector elements (int4.x, int4.y, etc)
vector operation on whole vectors
scalar operations on individual vector elements
vector operation on whole vectors
...
I wonder if there is any performance penalty when doing scalar operations on vector elements compared with situation when only scalar variables are used. Does it take any time to extract vector element? Does it help if I copy vector elements to scalar variables first?
Solved! Go to Solution.
On GCN the physical vector type is not int4, it's int64
Scalar instructions aren't working on individual vector elements, they have a separate 64 bit register space on which they work separated from (and paralell with) the vector alu.
There are instructions to extracts a specific element from a vector register into a scalar reg: v_readlane_b32, v_readfirstlane_b32. They eat 1 cycle.
"Does it help if I copy vector elements to scalar variables first?"
Why? The vector does 64x much operations than the scalar alu. Scalar is there for program control, address calculation, for the calculation of some temporary results that are common to all the 64lane wavefront, and also for some miscellaneous things.
On GCN the physical vector type is not int4, it's int64
Scalar instructions aren't working on individual vector elements, they have a separate 64 bit register space on which they work separated from (and paralell with) the vector alu.
There are instructions to extracts a specific element from a vector register into a scalar reg: v_readlane_b32, v_readfirstlane_b32. They eat 1 cycle.
"Does it help if I copy vector elements to scalar variables first?"
Why? The vector does 64x much operations than the scalar alu. Scalar is there for program control, address calculation, for the calculation of some temporary results that are common to all the 64lane wavefront, and also for some miscellaneous things.
"There are instructions to extracts a specific element from a vector register into a scalar reg: v_readlane_b32, v_readfirstlane_b32. They eat 1 cycle."
That's exactly what I've been looking for, thank you.