Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept I

How branch affect the work-items in one wavefront

In OpenCL, a wavefront containing 64 work-items is scheduled each time. As all work-items work in lock-step manner, so even one work-item is delayed (encounter cache miss or else), then all other work-items have to wait for that one. Then what confuse me is that: because in actual scheuduling process, a quarter of the wavefront (i.e. 16 work-items) is scheduled onto GPU cores in one cycle, and the whole wavefront will be executed in 4 consequent cycles.

1) One work-item in the first quarter is delayed, all other three quarters will be delayed?

2) If only one work-item from the second quarter is delayed, then the first quarter will be not delayed, but the 3rd and 4th quarter will be delayed?

Is that true on AMD GPUs?

2 Replies

Hi Acekiller,

    How quarter wave-fronts are actually scheduled is not known, but what is true is that one wave-front (of 64 threads) will always execute in one lock-step, as if they have an implicit barrier. Thus if any of the quarter wave-front is delayed, it will delay the whole wave-front.

    Please let us know if this answers your question.


AMD Support



On GCN there are 3 kinds of tasks to wait for:

- Vector <- memory

- Export (memory or GDS)

- LDS, GDS, Constant reads (and Messages)

Every category has a counter.

When you issue a command which is in the above 3, then it will put that command in a queue, and it will increase it's counter.

For example when you're exporting something, you must not touch the registers holding the outgoing data.

When a queued task is complete it will decrease its counter.

You have to use a scalar instruction s_wait to instruct the processor to wait until a specified counter reaches a specified value. When you tell it to wait until expcnt = 0, and then your program will only continue when the export queue is empty.

If you read something from memory, just give the command for it, and then you can do some vector or scalar math then finally give the s_wait command and hopefully the memory read operation is already completed at that moment.

"1-4 cycle, quarter wavefronts" -> That's only an internal configuration int the ALU to support the 4 stage pipeline. You don't have to think about it. You'll always have 64 element wavefronts where there are basically 2 instruction lengths: 1 cycles and 4 cycles (double). If you sum up the vector alu instructions this way, it will give you a good estimate in GPU core clocks. But dont forget that on a GCN the minimum workitem count is 4x the stream count of the processor. And if you want latency hiding too, you have to give it 8x or 10x(max).

1) 2) -> As I know everything runs in constant time in the scalar and vector alu-es. And all the unpredictable duration things handled by different units than the alu-es, and the scalar alu can wait for them with s_wait. The CU can execute 5 things in any given time: vector alu, scalar alu, [vector <- memory], Export, [LDS, GDS, Const]