sh,
The control flow instruction just change the clause of the instructions. It takes about 40 cycles before the new ALU clause can start. So you should be able to hide this latency by having high ALU\CF ratio or having large number of wavefronts in a compute unit.
But CF latency when the CF instruction diverges is much higher and thus very difficult to hide.