That's very strange. I haven't done any experiment on this. But, documentation says Branch Granularity is same as wavefront size (64 threads).
Could you post a sample code showing this behavior?
Can you post a code showing example of the behavior?
See the Stream Computing User Guide, section 1.3. Do the examples help explain the behavior?