I thought I would try out some of the new OpenCL 2.0 workgroup functions.
Comparing perf of work_group_scan_inclusive_add vs my home-grown prefix scan, I found that
work_group_scan_inclusive_add led to less work-item divergence, but used up 10 more VGPRs.
My own scan, using local memory, led to more divergence but no increase in VGPR usage.
Overall, work_group_scan_inclusive_add was faster. But, is there a way for this method
to use existing registers and not increase register pressure ?