There is a lot information about ALU and TEX performance, but I can't find any information about dispatch processor.
So my quenstion is: how many control flow instructions could be executed per cycle?
I think that 40 cycles is the CF instruction latency, not throughput. I asked about throughput. In other worlds, how much would be CF/ALU ratio before we get CF bound?
The control flow instruction just change the clause of the instructions. It takes about 40 cycles before the new ALU clause can start. So you should be able to hide this latency by having high ALU\CF ratio or having large number of wavefronts in a compute unit.
But CF latency when the CF instruction diverges is much higher and thus very difficult to hide.