Hi,
On GCN there are 3 kinds of tasks to wait for:
- Vector <- memory
- Export (memory or GDS)
- LDS, GDS, Constant reads (and Messages)
Every category has a counter.
When you issue a command which is in the above 3, then it will put that command in a queue, and it will increase it's counter.
For example when you're exporting something, you must not touch the registers holding the outgoing data.
When a queued task is complete it will decrease its counter.
You have to use a scalar instruction s_wait to instruct the processor to wait until a specified counter reaches a specified value. When you tell it to wait until expcnt = 0, and then your program will only continue when the export queue is empty.
If you read something from memory, just give the command for it, and then you can do some vector or scalar math then finally give the s_wait command and hopefully the memory read operation is already completed at that moment.
"1-4 cycle, quarter wavefronts" -> That's only an internal configuration int the ALU to support the 4 stage pipeline. You don't have to think about it. You'll always have 64 element wavefronts where there are basically 2 instruction lengths: 1 cycles and 4 cycles (double). If you sum up the vector alu instructions this way, it will give you a good estimate in GPU core clocks. But dont forget that on a GCN the minimum workitem count is 4x the stream count of the processor. And if you want latency hiding too, you have to give it 8x or 10x(max).
1) 2) -> As I know everything runs in constant time in the scalar and vector alu-es. And all the unpredictable duration things handled by different units than the alu-es, and the scalar alu can wait for them with s_wait. The CU can execute 5 things in any given time: vector alu, scalar alu, [vector <- memory], Export, [LDS, GDS, Const]