Archives Discussions

maxdz8 · ‎01-31-2015

I honestly forgot about it, but a couple of weeks ago realhet informed me shuffle is there for GCN at ISA level.

I could recall something about it and after a while I recalled a few pictures in GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah.

I see broadcast in CL2 by work_group_broadcast and I can see why this is easier to specify than the rest but...

Full 4-lane xbar? Yes.

I would be happy to see this as extension so to bypass the lengthy scrutiny to core spec. Say an CL_AMD_GCN_SIMD_SHUFFLE, guaranteed to work only with work group size 64. It would be enough for me. Is there any hope?

dipak · ‎02-02-2015

Thanks for your suggestion. I'll forward it to team and hope they may consider it in future.

Regards,

ekondis · ‎02-04-2015

These should be implicitly exposed via the subgroup functions in CL 2.0. However, evaluating their performance did not provide any benefits yet. I have provided a test case on another thread: http://devgurus.amd.com/thread/169868

maxdz8 · ‎02-05-2015

I don't need them for reduction but rather pure data shuffling 1:1 across work items.
In case I need reductions, I would accept a minor performance loss if that saves VGPRs or LDS.
I disagree on the testing methodology.

ekondis · ‎02-07-2015

3. I disagree on the testing methodology.

Could you be more specific about what you think?

maxdz8 · ‎02-09-2015

I usually don't trust any measurement taking less than 4-5 ms. On my system, those are normal fluctuations! I assume using profile mode takes care of that...

... I'm pretty sure CL profile mode has been invented to be used, I don't trust it. As most APIs have historically been easygoing on the profile data, I wouldn't take for granted the results are coherent...

besides I cannot exclude my app had a small overhead with profile on.

While the problem of reductions might be useful for some cases, I don't see value in a kernel doing only this, it's not relevant to me. In general, I've noticed using advanced functionality makes simple kernels slower (I have been told they take more time to "set up") while making complex kernels usually faster. This kernel is super simple, its only memory access is in writing out the result. There's no other workload and this severely hampers HW ability to do things optimally.

ekondis · ‎02-10-2015

Since the discussion is becoming off topic I replied on the relevant thread:

http://devgurus.amd.com/message/1307890#1307890

Archives Discussions

Are we going to see shuffle in CL?