Back in the early days of OpenCL AMD added the famous cl_amd_media_ops (2) to expose hardware features to the programmers. Sadly with some of there more recent or more hidden hardware features like GDS or the cross lane operations this is not the case - in fact using amdgpu-pro drivers or Windows Adrenaline it is almost impossible to use this features without external disassembler / assembler, which make it very painful to use, especially in quickly changing products or long programs.
Thus I wanted to propose two new extensions to be implemented, one that could be names cl_amd_gds and one cl_amd_cross_lane_ops.
For the GDS I am aware that virtualization is an issue especially since it remains valid cross kernels, so I would suppose creating an own space qualifier __gds (similar to __global) that also needs to be initialized like global memory - so with special host functions doing the virtualization in software and its only available as kernel argument, but can not initialized within the kernel. Also access and barriers would be similar to access to __global.
For the cross lane operations it would be nice at least to have
gentypen amd_ds_bpermute(gentypen sourceRegister, uint lane) where lane is modulo laneSize (32 or 64), which can be received via CL_DEVICE_WAVEFRONT_WIDTH_AMD
gentypen amd_ds_permute(gentypen sourceRegister, uint lane)
and maybe some broadcast operation based on swizzle.
I think I would not be the only one loving to have easy access to this great hardware features.