Be realistic. Work items aren't threads. The hardware runs wavefront wide threads, so at the very least it will always have to run a multiple of that, nothing else is possible. The hardware dispatches these threads in groups for efficiency and LDS allocation reasons, not doing so would be considerable overhead. So what you gain from this design is an efficient execution model.
Think of your data in terms of that reality. You can, if you like, have an if that masks out work items that you don't have valid data for, that's one approach. The alternative, knowing how the hardware really works, is just to lay your data out appropriately. Make sure there is data there for every work item, even if some of it is junk data. Let the hardware tick along processing that data and spitting out equally junk results. That way you can drop the if tests that you might need in every work item otherwise even though they only apply to the very last wavefront or two.