Thank you for your suggestion.
AFAIK, the latest OpenCL compiler can generate the cross-lane instructions for the subgroup functions (cl_khr_subgroups ). Not sure about this particular test-case though.
Anyway, I'll share your suggestions to the appropriate team. Hope more optimizations in this regard will be added in future.
Yes indeed, cl_khr_subgroups enables some functionnalities, like the function I mentionned sub_group_reduce.
The generated ISA indeed uses some of the cross lane functions described on the page, but in a suboptimal fashing (manual code using lds being faster, while like the ISA generated should be much more efficient using the method described at the bottom of the gpuopen page).
cl_khr_subgroups doesn't enable however reduce operations at a level smaller than the subgroup, which I would need.