The global size must be divisible by the local size. From the OpenCL 1.1 specification, page 132:
If local_work_size is specified, the values specified in global_work_size, …global_work_size[work_dim - 1] must be evenly divisible by the corresponding values specified in local_work_size, … local_work_size[work_dim – 1].
So the solution if you don't have a nice problem size would be to just have a few extra work-items that are either not doing anything, or who's results you ignore afterwards or something similar.
Be realistic. Work items aren't threads. The hardware runs wavefront wide threads, so at the very least it will always have to run a multiple of that, nothing else is possible. The hardware dispatches these threads in groups for efficiency and LDS allocation reasons, not doing so would be considerable overhead. So what you gain from this design is an efficient execution model.
Think of your data in terms of that reality. You can, if you like, have an if that masks out work items that you don't have valid data for, that's one approach. The alternative, knowing how the hardware really works, is just to lay your data out appropriately. Make sure there is data there for every work item, even if some of it is junk data. Let the hardware tick along processing that data and spitting out equally junk results. That way you can drop the if tests that you might need in every work item otherwise even though they only apply to the very last wavefront or two.