Please let me know if you want me to re-phrase the question or need more information.
1) Pass actual size as the parameter and compare get_global_id(0) with it in the start of the kernel.
2) It is actually the most effective way to use GPU. Number of wavefronts required to execute the kernel is much more important than global work size.
3) Read AMD OpenCL Programming guide.
In answer to your specific questions:
1) Yes, it's beyond the range of data but as the example probably shows and maximmoroz pointed out you can pass that in and catch it in a conditional.
2) This is why defining "thread" sensibly is important. You're not wasting wavefronts, and that's the real execution multiple of the machine (a wavefront is a thread if you think in CPU terms). You may waste a few lanes of the last one in the machine, but who cares? The machine will be running maybe 150 wavefronts concurrently, and thousands over a dispatch, if 1/4 of one of those thousands of wavefronts is unused you won't even notice the difference.
The correct answer to the general question of whether you want to round your launch up, though, is "maybe".
It depends on whether the if test at the beginning of the kernel incurs more overhead than padding your data. If you can lay your data out with ignorable junk at the end up to the next wavefront boundary then on any reasonable size dataset the extra memory reads will be negligable, but if the kernel is short then the branch will be costly.
On the other hand, rearranging your data in that way might be difficult under some circumstances and that might incur overhead from performing a copy.
In general for peak performance you should lay your data out to match that you're running on a vector processor when you use the GPU, and that includes padding to save doing a range test in the kernel.