Interesting article on getting higher performance by reducing occupancy
(written for CUDA, but some concepts should apply to OpenCL)
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf