According to https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ the hardware is able to do refined reduce operations.
By 'refined', I have in mind doing an add/min/max among neighboring work-items 0-3, 4-7, etc or 0-7, 8-15, etc in a wavefront.
In my usecase, my kernels would benefit from having reduce operations available among power of two groups of neighboring work-items. I think it would be better if the reduce operations would allow non-power of two patterns, but it's likely harder to implement. For my part power of two would be already very useful.
Currently the opencl spec enables to do reduce operations for a sub_group or a work_group (sub_group_reduce and work_group_reduce).
However the generated ISA (on RX480) is very disappointing and is slower than my manual code using lds.
A reduced add on a wavefront level should be implemented like in the last example of the gpuopen web page (replacing the nops with independant instructions), thus using only 7 add instructions.
I expect the new extension for the refined add/min/max reduce would use only a few add instructions by adapting the last example of the gpuopen webpage.
To my knowledge, it should be already possible to use inline assembly to implement the feature, however I would like an opencl extension instead because:
. The compiler can adapt to AMD hardware generation requirements (number of NOP, etc). I don't want to target only one generate or update the code every generation.
. The compiler can replace the NOP with independant operations.
. The code is more portable
. If AMD implements new hardware features in the future that helps the extension performance, AMD can use these for the extension.