Did anyone noticed huge branching penalty?

For 500-vliw long core, having 1 branch (success chance ~1/400'000'000, so everyone goes on same branch most of the time) reduces speed by around 10-20%. Is there anyone else noticed that?

That is quite unusual after CUDA where branches are almost free as long as all threads follow the same code path.