Thanks for your suggestion. I'll forward it to team and hope they may consider it in future.
These should be implicitly exposed via the subgroup functions in CL 2.0. However, evaluating their performance did not provide any benefits yet. I have provided a test case on another thread: http://devgurus.amd.com/thread/169868
- I don't need them for reduction but rather pure data shuffling 1:1 across work items.
- In case I need reductions, I would accept a minor performance loss if that saves VGPRs or LDS.
- I disagree on the testing methodology.
3. I disagree on the testing methodology.
Could you be more specific about what you think?
I usually don't trust any measurement taking less than 4-5 ms. On my system, those are normal fluctuations! I assume using profile mode takes care of that...
... I'm pretty sure CL profile mode has been invented to be used, I don't trust it. As most APIs have historically been easygoing on the profile data, I wouldn't take for granted the results are coherent...
besides I cannot exclude my app had a small overhead with profile on.
While the problem of reductions might be useful for some cases, I don't see value in a kernel doing only this, it's not relevant to me. In general, I've noticed using advanced functionality makes simple kernels slower (I have been told they take more time to "set up") while making complex kernels usually faster. This kernel is super simple, its only memory access is in writing out the result. There's no other workload and this severely hampers HW ability to do things optimally.