As far as I know it should work, although what you wrote code does not do what you explained in text.
If it doesn't, you can definately rearrange your code so that all work-items hit the same barrier.
while( i < longestRun )
if( I_am_thread_to_do_work )
i += whatever;
Hope that helps.
Yeah, that's pretty much what I ended up doing. Unfortunately, my code has tons of barriers in the main loop, because I have to compute a scan in shared memory each iteration and update some indices. I can get rid of a bunch of them if I use a private register instead of a local memory variable. But this has the unfortunate side effect of hurting memory bandwidth when finding the pivot point. I guess I'll have to experiment to see which is the lesser evil.
if you use local memory as private array then you don't need a barrier.