If you're looking for loop overheads:
On GCN chips (HD77xx+) the overhead is small: 1..4 cycles. If the loop is the same for all the workitems in a wavefront, it can be realized in 1 cycle with the Scalar ALU.
On older VLIW chips it costs a clause switch which can take longer time. 10-40 cycles I guess.