we need to chose the local workgroup size to in the order of warp size for optimal performance.
Does that apply to Global work size, I've seen following code in one of Nvidia OpenCL slides.
size_t localWorkSize = 256; ( or 64)
// will round it
int numberWorkGropus = ( N + localWorkSize -1) / localWorkSize ;
size_t globalWorkSize = numberWorkGropus * localWorkSize ;
But, rounding and multiplying increases a global workgroup to be beyond the 'N',
1) We would be having global_id beyond required, how to take care of it ?
2) It can waste the OpenCL threads, Is there any advantage in doing it ?