I had previously reported this but here I provide a test case for the examination of workgroup reduction function. Kernels perform a workgroup reduction in 3 ways:
1) The classical one with shared memory (OpenCL 1.2)
2) Shared memory plus sub-group reduction function on the final stage
3) Workgroup reduction function (no shared memory at all)
I tested it on a R7-260X and the latter two kernels prove to be significantly slower than reduction in shared memory. The last one especially is more than 5 times slower than using pure shared memory. This fact eliminates the value of these new functions in OpenCL 2.0. AFAIK, GCN GPUs feature swizzle operations which would could potentially make workgroup functions quite efficient. This is not the case however.
In addition the CodeXL 1.6 static kernel analyser does not support OpenCL 2.0 kernels. Therefore, I cannot investigate the disassembled kernel code.
Code on Github: https://github.com/ekondis/cl2-reduce-bench
More details on blog: http://parallelplusplus.blogspot.gr/2014/12/workgroup-reduction-function-evaluation.html