I honestly forgot about it, but a couple of weeks ago realhet informed me shuffle is there for GCN at ISA level.
I could recall something about it and after a while I recalled a few pictures in GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah.
I see broadcast in CL2 by work_group_broadcast and I can see why this is easier to specify than the rest but...
Full 4-lane xbar? Yes.
I would be happy to see this as extension so to bypass the lengthy scrutiny to core spec. Say an CL_AMD_GCN_SIMD_SHUFFLE, guaranteed to work only with work group size 64. It would be enough for me. Is there any hope?
These should be implicitly exposed via the subgroup functions in CL 2.0. However, evaluating their performance did not provide any benefits yet. I have provided a test case on another thread: http://devgurus.amd.com/thread/169868
I usually don't trust any measurement taking less than 4-5 ms. On my system, those are normal fluctuations! I assume using profile mode takes care of that...
... I'm pretty sure CL profile mode has been invented to be used, I don't trust it. As most APIs have historically been easygoing on the profile data, I wouldn't take for granted the results are coherent...
besides I cannot exclude my app had a small overhead with profile on.
While the problem of reductions might be useful for some cases, I don't see value in a kernel doing only this, it's not relevant to me. In general, I've noticed using advanced functionality makes simple kernels slower (I have been told they take more time to "set up") while making complex kernels usually faster. This kernel is super simple, its only memory access is in writing out the result. There's no other workload and this severely hampers HW ability to do things optimally.