On section 6.6.1 of APP guide, "hiding ALU and Memory Latency" I read:
The read-after-write latency for most arithmetic operations (a floating-point add, for example) is only four cycles.
Read-after-write... since SI devices take 4 cycles to execute an instruction, what I understand is that they WRITE the result after four clocks, that is, the first 16-WI slice of a result. The register is marked "being written" somehow so it cannot be read.
Or in other terms, I cannot use a value immediately after computation and I must but there at least one interleaving instruction.
Is this correct?
I am not quite sure of GCN ISA but I had some kernels which seem to use wait instructions for no apparent reason.
I've also measured some performance increase in a case where I manually merged two WIs... but again I have no real evidence.
I'll have to write a few things next week so I guessed it was a good time to ask. Thank you for input.