p 44:
"In this example:
for (ptr=base; ptr<max; ptr += 16KB)
R0 = *ptr ;
where the lower bits are all the same, the memory requests all access the same
bank on the same channel and are processed serially.
This is a low-performance pattern to be avoided. When the stride is a power of
2 (and larger than the channel interleave), the loop above only accesses one
channel of memory."
Agreed with the reasoning, disagree with conclusion and scenario. I think that this is what exactly
we want in a kernel. The code in the loop should run serially for any given kernel (aside from
compiler optimizations, that may parallelize instructions), so that parallel kernels have the chance
with a base offset to use different channels. To that effect, unit strides, mentioned elsewhere in the same
page, would be the worst possible scenario.
Also to my understanding only memory writes can be conflicted. No reason for memory reads to be.
Am I missing smt?