It seems that you have experienced the same bug as I did here Poor workgroup reduction function performance (OpenCL 2.0).
I still find workgroup reductions to be quite slow, producing a huge stream of instructions.
What type of GPU are you using?
Using FirePro W7100.
I wonder with the long time the SDK v3.0 has been at beta, what kind of testing was done that neglected performance in such a way.
Essentially, it's like getting an emulation of the OpenCL 2 features without being able to use it for production environment.
Tomer Gal, CTO at OpTeamizer
Hmmm... It's sad that even the W7100 exhibits such low performance. I was hoping that the Tonga based W7100, as featuring the most recent instruction set (volcanic islands) would perform better but that seems not to be the case. This absolutely eliminates the benefits of workgroup reduction functions and forces programmers to use handcrafted implementations. GPU programming is about performance after all.