or just placeso you don't get access violation.
if(width<get_global_id(0) && height<get_global_id(1))
and you don't get error when you access out of allocated buffer on GPU. but it can mess up somewhere else. so it is even worse than on CPU.
I am not aware where global size would not be divisible by local size.Are you trying to club work of many pixels inside one kernel.
Can you please tell you global & local sizes.
As nou mentioned, you don't need to pad your image, you just need to pad the dimensions of your threads. For example, say you want a group size of 8x8 but your image is 10x10. Simply pad the dimensions for the clEnqueueNDRangeKernel() to 16x16 but then clamp within the kernel itself.
You will lose some efficiency because some wavefronts will only be partially utilized, but for large images it should be marginal.
Thank you alot for the answers and sorry for me answering so late, I totally had forgotten that I hadn't answered already.
I did just what Jeff Golds and nou suggested.
Also, one of the parameters for clEnqueueWriteImage etc. is the copy region, i.e. the size in pixels that should be copied. This enables you to safely move data between the host and device buffer without worrying about overruns.