OpenCL

mannerov · ‎09-21-2018

Hi,

According to https://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ the hardware is able to do refined reduce operations.

By 'refined', I have in mind doing an add/min/max among neighboring work-items 0-3, 4-7, etc or 0-7, 8-15, etc in a wavefront.

In my usecase, my kernels would benefit from having reduce operations available among power of two groups of neighboring work-items. I think it would be better if the reduce operations would allow non-power of two patterns, but it's likely harder to implement. For my part power of two would be already very useful.

Currently the opencl spec enables to do reduce operations for a sub_group or a work_group (sub_group_reduce and work_group_reduce).

However the generated ISA (on RX480) is very disappointing and is slower than my manual code using lds.

A reduced add on a wavefront level should be implemented like in the last example of the gpuopen web page (replacing the nops with independant instructions), thus using only 7 add instructions.

I expect the new extension for the refined add/min/max reduce would use only a few add instructions by adapting the last example of the gpuopen webpage.

To my knowledge, it should be already possible to use inline assembly to implement the feature, however I would like an opencl extension instead because:

. The compiler can adapt to AMD hardware generation requirements (number of NOP, etc). I don't want to target only one generate or update the code every generation.

. The compiler can replace the NOP with independant operations.

. The code is more portable

. If AMD implements new hardware features in the future that helps the extension performance, AMD can use these for the extension.

Thanks !

dipak · ‎09-21-2018

Thank you for your suggestion.

AFAIK, the latest OpenCL compiler can generate the cross-lane instructions for the subgroup functions (cl_khr_subgroups ). Not sure about this particular test-case though.

Anyway, I'll share your suggestions to the appropriate team. Hope more optimizations in this regard will be added in future.

Thanks.

mannerov · ‎09-21-2018

Yes indeed, cl_khr_subgroups enables some functionnalities, like the function I mentionned sub_group_reduce.

The generated ISA indeed uses some of the cross lane functions described on the page, but in a suboptimal fashing (manual code using lds being faster, while like the ISA generated should be much more efficient using the method described at the bottom of the gpuopen page).

cl_khr_subgroups doesn't enable however reduce operations at a level smaller than the subgroup, which I would need.

OpenCL

Please add new extension for refined reduce in wavefront