I am currently working on an OpenCL implementation of an algorithm that works on pictures. As it works pixel by pixel I want to use one workitem per pixel and therefore set the number of global threads (2 dimensional) to (image_width, image_height). As far as I discovered, the number of global threads must be devisible without a remainder by the number of local threads. This is of course not possible for every imagesize. I am not sure how to handle the situation if it is not devisible without a remainder (afaik the SDK samples do not cover that case). At the moment I use the next-highest number which is devisible for the global threads. But on CPU this would be a SEGFAULT since I am accessing positions in the buffer that are beyond the imagearraybounds.
What is the common solution for this? Do I have to ajust the Buffer size in the same way, transporting unnecessary "fill-data" to the GPU?
Thanks in advance,