cancel
Showing results for
Did you mean:

# Archives Discussions

## how to dynamically index to vector component

I have a piece of code as below:

...

int4 vec;

...

if (condition == 0)

vec.x = 100;

else if (condition == 1)

vec.y ++;

else if (condition == 2)

vec.z ++;

else

vec.w ++;

Is it possible to index to the component dynamically to avoid branching? Ideally, I can do this:

vec.condition ++;

But the above code does not work

1 Solution
Staff

No, vectors cannot be dynamically indexed. Your best bet is to push the vector into private array, index into it and hope that the compiler can optimize it into register based indexing.

10 Replies
Staff

No, vectors cannot be dynamically indexed. Your best bet is to push the vector into private array, index into it and hope that the compiler can optimize it into register based indexing.

Thank you, Micah!

Challenger

that code will probably be compiled to unbranching code anyway as it isn't accessing memory.

e.g. into something like:

vec.x = condition == 0 ? 100 : vec.x;

vec.y = condition == 1 ? vec.y+1:vec.y;

etc.

You could probably try rewriting it in a more vector-ish way, but that would probably only be worth it on a vector processor.

e.g. something such as (assuming condition limited to 0, 1, 2, or 3)

vec = select(vec, (int4)100, (int4) { 0, -1, -1, -1 } == condition);

vec = select(vec, vec+1, (int4){ -1, 1, 2, 3 } == condition);

Hi Notzed,

Your answer is of great interest. I have some new question from your answer:

(1)  Does the selective operator ? and : translate to a single instruction without branching?

(2) Is a vector operation with some bits masked out more efficient than a scalar operation?

More specifically, is

vec = select(vec, (int4) 100, (int4)(0, -1, -1, -1) == condition)

more efficient than

vec.x = select(vec.x, (int) 100, 0 == condition)

?

I think those who understand IL Assembly can help.

Thank you very much!

Challenger

1) It will be 2 instructions at least, one for the condition test and one for the select.  But in this case it also needs a third to prepare the true condition, and often it might need more e.g. to calculate each alternative or to calculate the condition.  On a scalar processor it will be whatever this 'n' is multiplied by the vector width.

2) Maybe in some cases - but only on vector processors.  It was just a contrived example, and i was just showing it fully 'vectorised'.  On a vector processor (i.e. SIMD), it wont be any less efficient though, as it's always operating at some fixed vector width and will need to mask out unused elements at the compiler level anyway - which might take more operations since the compiler doesn't know 'condition' has a fixed range.

The AMD GPU processors pre GCN are 4 or 5-way VLIW, not SIMD, so strict vectorisation isn't very important and might make things slower.  It can already pack different instructions into the same word and can do other work rather than noop on un-changed elements.  And GCN is scalar the same as fermi (i believe).

Hi Notzed,

Thank you very much for your comments. They are very helpful.

I am currently working on a 5870, not based on the GCN structure. According to my understanding, within a compute unit (sometimes simply called a SIMD core), all processing elements work in a SIMD fashion. That means all work items in a wavefront share a single instruction counter. Therefore, VLIWs on all 4-way ALUs have to be the same.

Is this understanding correct?

Thank you again!

Each slot of the VLIW can execute a different instruction as long as it follows the rules of the hardware(register porting, instructions slot requirements, etc..).

Challenger

The whole processor core is SIMD: but that is SIMD across all work items and it is something inaccessible to the programmer.  This is basically how the desktop gpu's work and this is why branching is expensive.  However, each SIMD 'channel' is executing the same  VLIW instruction, within each work item.

It seems a bit confusing at first, but it's fairly straightforward.  It's been discussed before in the forums, and the info is available in magazine articles on toms hardware, anandtech and so on.

Micah just said the VLIWs can be different. I hope that each work item works independently, and braching is never something to worry about

Challenger

You don't seem to understand, each work item is part of the same SIMD unit: i.e. all work items in the work-group all work lock-step, and all execute the same instruction at the same time.  The independent VLIW bit is only within the instruction and it's not something you can directly affect, it's just a tool for the compiler.

Just ignore the terms 'simd' or 'vliw': just remember that all items in the work-group execute in parallel based on the hardware wavefront size.

So branching is something that you absolutely must worry about ... well if you care about gpu performance at any rate.