wavefront = warp in cuda
group = block in cuda
LDS shares data within a group, which may contain several wavefronts. However, I guess that fensing LDS will be slow when using multiple WFs in one group. You can see a loop used for synchronization LDS accesses in the disassembly.
What is the problem you meet?