From the AMD App programming guide, from chapter 5.2, Local Memory (LDS) Optimizations, I read:
Bank conflicts are determined by what addresses are accessed on each half wavefront boundary. Threads 0 through 31 are checked for conflicts as are threads 32 through 63 within a wavefront.
In a single cycle, local memory can service a request for each bank
The LDS hardware examines the requests generated over two cycles (32 work-items of execution) for bank conflicts. Ensure, as much as possible, that the memory requests generated from a quarter-wavefront avoid bank conflicts by using unique address bits 6:2
The example about the 64bit access pattern seems to agree on that being an optimal access pattern. Yet 64 bit per-WI are 2 banks then there will be 16 WI, which is exactly one clock of work for the SIMD lane but only one quarter of a wavefront. The document is very clear is warning about quarter-wavefront conflicts but why is it using this wording if the conflicts are generated on 0-31, 32-63?
I have issues understanding the half-wavefront conflict thing. If a bank pulls a request each clock why are conflicts among 0-31, 32-63 instead of 0-15,16-31,32-47,48-64?
I suppose this would make sense if LDS had a 1-clock latency but it doesn't seem to be the case from what I read.
Can you explain me what is this?
I have a kernel which has surprisingly generated almost 12% bank stall so I guess it's time for me to understand LDS completely.