global work size must be divisible by local work group size } in each and every dimension.
Are you making sure of this?
Also, 16 is not a nice number on GPU. Use multiples of 64 in order to use the GPU hardware effectively lest you should wither away hardware cycles for non-existing workitems
yes, thank you! you are right ... it was the problem of the number of global work-item dimensions has to be divisable by the local dimensions!