The reason is that the work items are not in any way independent execution paths in the hardware. What the hardware actually does is more like a sophisticated version of SSE. It executes one (VLIW) instruction over 64 lanes of SIMD in one cycle. The programming model, for historical and probably for beneficial reasons, exposes this on a per-lane basis but to some degree this has the risk of giving the wrong idea about how the hardware is working. It is not combining separate threads for efficiency, it is executing one 64-wide thread that you can program lane by lane because it's easier than writing 64-wide SSE-style gather and scatter instructions.
I don't know if I can give you an optimal group size. I do find it hard to believe that the second kernel really has an optimum count of 2, though, based on past experience. Presumably you have more than one group of size 2, yes? Can you not combine them together in a single group? That way you can just have the first 512 for one kernel and the next 64 for the other, or something like that, rather than branching in the same wave. There are cases where that doesn't work, but usually you just need to slightly rethink the way you are trying to map to the hardware.
If I were you I'd forget "threads", work items and work groups completely and redesign your entire algorithm in terms of wavefronts. That's how the hardware works, that's how I would program it. Don't say "how many work items do I need to achieve this" say "how many wavefronts do I need to achieve this" and design your data layouts, execution patterns and so on based on that assumption in the way that you would if you were developing for SSE.
thank you very much for reply ! i'm a freshman in GPU program, would you please recommend me something about it ?