For better understanding of your problem , can you share the kernel which has 12% bank stall . if possible the the host and solution file .
Thanks & Regards,
Hello Suresh, I apologize for the late reply.
I'm not sure I can share the kernel code right now - by sure I cannot share the host code. Both are part of performance enhancements to a widely used program I plan to improve for personal promotion and releasing the code before build release would put effectiveness of this whole thing at risk.
What I can tell is that it's a 16-way parallel implementation of NIST SHA3 candidate ECHO in the reduced form used by cryptocurrency miners. I have developed this to update my skillset to more modern GPU programming methodologies.
ECHO is based on Rijndael round, which is often implemented as a 4 lookups to constant LUTs. The reference implementation already generates ~1.6% of bank stall. My implementation profiles about 7x faster so having 12% stall is not a surprise there (1.6 * 7 = 11.2).
I honestly believe that specific kernel is performing more or less as expected and I've since moved to the next ones.
I have written all other kernels assuming collisions on WI 0-31, 32-63 and so far everything went smooth however my confusion about those statements is still there.
Can a LDS bank service a request per clock or not? If not, why?
I am still confused about the hardware examining "requests generated over two cycles (32 work-items of execution) for bank conflicts" even though we must " Ensure, as much as possible, that the memory requests generated from a quarter-wavefront avoid bank conflicts". The former statement seems to suggest using LDS is somehow a 2-clock operation, while my understanding of the latter is just the opposite. It would seem to me that WI16-31 could still access banks 0-15 even though they were accessed by WI0-15 without causing a conflict... or perhaps they do cause a conflict but not a stall?
I am also a bit confused about the "broadcast" feature. Somehow the wording seems to suggest this is effectively a multicast rather than a pure broadcast. Is this correct?
A single 64-bit request by a wavefront is considered as two 32-bit bank accesses by each work-item. As the doc says, the accesses of two quarter wavefronts are considered over two cycles. Each cycle, half of the 64 accesses from the two quarter waves are presented to the LDS. It may not be assumed that the 32 requests presented in a cycle are all from the same quarter wave. The doc could really be improved by stating that all 64 possible bank accesses made by two quarter waves should be minimized.
I think I sort-of understand what's going on here. I would really appreciate if the paragraph could be reconsidered. Maybe some pictures might help.
If I understand, the bank itself can provide a 32bit value but since the arbitrator is somehow half-rate the output of the bank can be routed to a single SIMD lane register (if not broadcast) so the second request could "in theory" access the bank but in practice the result would go somewhere else so at the end of the day it must stall.
In my mind, I'm picturing something like a 32-way crossbar switching at half rate... I guess that somehow makes sense. I hope it's also correct.
Thank you very much.
Here's how it exactly works, maybe it helps to understand, and to use it for your problem:
1. There is a ds command in the instruction stream, that tells the lds unit to start fetching data in the BACKGROUND. The lds unit has a queue for this. There can be many requests in it.
2. Meanwhile the Vector SIMD units can process some vector instructions, in the same time as the lds unit is busy.
3. Here comes another instruction, that queries the queue of the lds unit. If it is not ready then it win WAIT until that.
4. At this point the lds data is in the destination register(s) so the vector unit can work on them.
So with clever optimization you can insert many vector/scalar instructions between 1. and 3. so the vector alu can work alone while the lds unit is working. This hides lds latency.
But when your bottleneck becomes the lds unit itself, then you have to start thinking about bank conflicts:
All you have to know is that every bank can handle a particular set of addresses (4byte aligned!):
bankId*4+80h*N //N can be anything. This number will address the particular memory module attached to the bank.
If you have 2 workitems in a wg, where you write down the addresses in the above form and the bankId values are the same but the N values are different, that will cost an extra clock until that bank can handle both requests.
Some examples for 2 workitems:
00h, 80h -> bank 0 reads from 0 and then from 80h, 2 clock
00h, 00h -> bank 0 reads and broadcasts, 1 clock
00h, 84h -> bank 0 and bank 1 reads in parallel, 1 clock
(And let me tell an other method of reading the LDS: The LDS_DIRECT. It's an input register that can be used in every vector instruction. It will read a single 32bit (maybe 64bit too, i'm not sure) value from the LDS and will broatcast it to all the workitems as a parameter to the current vector instruction. The address resides in a scalar register called m0.)
Thank you realhet, I totally forgot LDS is asynchronous.
Can you say more about the latency?
I depict LDS "by column". With the requests taking whole columns. What I do is to consider the columns/banks busy for two clocks -> 32WI.
What I've seen is this implies my kernels have no conflicts so I assume latency is at worse a couple of clocks.
I think... my perception of how it can be used isn't considerably different from start with exception perhaps of 64bit values which I don't quite use.
I suggest to scrap the whole thing about wavefronts here.
When LDS is introduced, it is pretty clear GCN 64-way SIMD is in fact 16-way SIMD 4 clocks pipelined.
I think there should be a term (perhaps "slice"?) to identify this part of wavefront mangled in a clock. I think the wording would result simplified.