Archives Discussions

nibal · ‎10-12-2015

p 44:

"In this example:

for (ptr=base; ptr<max; ptr += 16KB)
R0 = *ptr ;

where the lower bits are all the same, the memory requests all access the same
bank on the same channel and are processed serially.
This is a low-performance pattern to be avoided. When the stride is a power of
2 (and larger than the channel interleave), the loop above only accesses one
channel of memory."

Agreed with the reasoning, disagree with conclusion and scenario. I think that this is what exactly

we want in a kernel. The code in the loop should run serially for any given kernel (aside from

compiler optimizations, that may parallelize instructions), so that parallel kernels have the chance

with a base offset to use different channels. To that effect, unit strides, mentioned elsewhere in the same

page, would be the worst possible scenario.

Also to my understanding only memory writes can be conflicted. No reason for memory reads to be.

Am I missing smt?

sandyandr · ‎10-13-2015

They talk not about different kernels, but the same kernel, which is split "vertically" into different work-items, which run there simultaneously. Loop goes serially for each work-item and nothing can change this behaviour (you can only say to compiler: "make several iterations look as the same part was written several times"). Anyway, it will run them serially an try to finish the loop at the same time for all work-items (if the number of iterations is the same for each work-item). So, as each work-item does the same job for different data - at each moment, this data shouldn't go to/from the same bank/channel - in order to avoid conflicts. As I remember, reading conflicts will not occur only under certain conditions for read-only memory (images, constants, etc.), so, you'd better try to avoid them as well.

nibal · ‎10-13-2015

Thanks for your fast reply,

I really understand that the work items in a synchronized wavefront are all parallel kernels executing the same instruction step on different data. Let's consider 2 of those work items:

char *ptr;

A)

for (ptr = base, ptr < max; ptr += 16384) // 16384 = 2^14

R0 = *ptr;

B)

for (ptr = base + 512; ptr < max; ptr += 16384) // 512 = 2^9

R0 = *ptr;

Bits 8..10 (256 - 1024) are used for different channels.

Case (A) will run always from channel A, since bits 8..10 are the same during the whole iteration.

Case (B) will run always from channel B, since bits 8..10 are the same during the whole iteration.

However, channel B # channel A, since base offset (512) changes bit 9.

I don't see any conflict.

For the same token, if stride is 1 instead of 16384, after 256 iterations case (A) will change channel and could conflict.

sandyandr · ‎10-13-2015

For the same token, if stride is 1 instead of 16384, after 256 iterations case (A) will change channel and could conflict.

case (B) will change bit 8 at the same time. Bit 9 should still be opposite to (A).

nibal · ‎10-13-2015

> case (B) will change bit 8 at the same time. Bit 9 should still be opposite to (A).

Not exactly. that depends on the offset value. According to it, channel B could change earlier or later. Point being, is that at low strides, both stride and offset affect channel switching, and therefore possible conflict, whereas in high strides these 2 are decoupled and offset alone can dictate channel. In case (B) with an offset of 512 and stride of 1, work item will change channel with every step 😞

If you know your memory topology, using a large stride, you can control in which channel you want each work item to run. Of course you got to have the memory to support big strides...

sandyandr · ‎10-13-2015

I think you're wrong:

B)

for (ptr = base + 512; ptr < max; ptr++)

R0 = *ptr;

This work-item will change ptr with each iteration in the same way and at the same moment, as work-item (A) will increment its ptr (in each iteration lower bits (8:0) of A's and B's ptrs will be equal) - it's OK and can't lead to channel conflict - all differences start from bit 9 anyway. The problem is that these small regions will too soon overlap already processed ones (processed by the neighboring work-item) - that's why they said 16K, I guess.

nibal · ‎10-13-2015

Let me give you an example:

char *ptr = (char *) 16384; // or any other high memory with first 10 bits set to 0

A)

for (ptr; ptr < max; ptr++)

R0 = *ptr;

This will change channel after 256 iterations

B)

for (ptr + 512; ptr < max; ptr++)

R0 = *ptr;

This will change channel immediately in iterations: +512, +513, +514...+1024

sandyandr · ‎10-13-2015

If channel is defined by bits 10:8, then

A - will change the channel from "000" to "001" in 256th iteration, to "010" in 512th iteration, to "011" in 768th iteration, to "100" in 1024th iteration and so on.

B - will change the channel from "010" to "011" in 256th iteration, to "100" in 512th iteration, to "101" in 768th iteration, to "110" in 1024th iteration and so on.

As you may see bit 8 is the same for A and B all the time. Bit 9 is opposite all the time. Bit 10 can vary. Actually, this rule will work here for any starting address (your "char *ptr = (char *) 16384;" is not necessary here).

nibal · ‎10-13-2015

16384 was just an example to ilustrate a point. Any memory could have been used.

You are right stride 1, doesn't hurt, although it starts changing channels quickly in a synchronized manner.

Large strides, stay put at the initial channels.

Thanks for clarifying the stride 1 case.

Still I can't see the point of the guide that large strides power of 2 are to be avoided...

sandyandr · ‎10-13-2015

As I may guess, Guide tries to say, that the worst scenario is when all bank:channel+lower bits are the same for all work-items - that's why large power of two for strides is bad. The best scenario is when your adjacent work-items read (or write) adjacent memory addresses, while all channels/banks are equally utilized by CUs, though there will certainly be a lot of bank/channel conflicts. Anyway, you obviously need to select X and Y in "for (ptr = base + X; ptr < max; ptr += Y)" and work-group size in such a way you could process the whole amount of data while utilizing all channels/banks uniformly.

nibal · ‎10-13-2015

As I said in my initial post, I understand what the guide is saying, but disagree with its example and conclusion. Large strides, powers of 2, are perfectly fine and even "neater" than single strides.

sandyandr · ‎10-14-2015

"The important concept is memory stride: the increment in memory address, measured in elements, between successive elements fetched or stored by consecutive work-items in a kernel."

for (ptr=base; ptr<max; ptr += 16KB)

R0 = *ptr ;

May be I don't understand something here right, but I don't know why they call 16KB as a "stride" - indeed, it's an increment, sure, but just for the next cycle - not between "consecutive work-items". Anyway, as for me, the main rule is this: in each moment work-items (in each work group) should access adjacent addresses (from a single channel sometimes - it's OK), while different workgroups access different channels (and it will be much better if these workgroups occupy different CUs, of' course).

nibal · ‎10-14-2015

> I don't know why they call 16KB here as a "stride" - indeed, it's an increment, sure, but just for the next cycle - not between "consecutive work-items". May be I don't understand

> something here right.

This is indeed called the "stride" or "step" of the loop. The different bases between work-items in loops are called "offset".

> Anyway, as for me, the main rule is this: in each moment work-items (in each work group) should access adjacent addresses (from a single channel

> sometimes - it's OK and not a conflict), while different workgroups access different channels (and it would be much better if these workgroups occupy different CUs, of' course).

work-items are part of wavefronts. These act in a synchronous mode. 1/4 wavefront in each cycle checks for memory. Adjacent addresses, especially from a single channel, from the same wavefront, are a recipee for conflicts. Wavefront and all of its work items will stall until all conflicts are resolved. Of course, smt it cannot be avoided 😞

sandyandr · ‎10-14-2015

work-items are part of wavefronts. These act in a synchronous mode. 1/4 wavefront in each cycle checks for memory. Adjacent addresses, especially from a single channel, are a recipee for conflicts. Wavefront and all of its work items will stall until all conflicts are resolved. Of course, smt it cannot be avoided 😞

I can't agree here. Each cycle reads/writes a lot of bytes at once - they all should be processed. By the way, if each of your work-items in a wavefront accesses different channel there will be conflicts anyway. There are 12 channels only (as for 7970).

Guide: "An inefficient access pattern is if each wavefront accesses all the channels. This is likely to happen if consecutive work-items access data that has a large power of two strides."

nibal · ‎10-14-2015

> I can't agree here. Each cycle reads/writes a lot of bytes at once - they all should be processed. By the way, if each of your work-items in a wavefront accesses

> different channel there will be conflicts anyway. There are 12 channels only (as for 7970).

Read the guide

Archives Discussions

Optimization Guide: GCN Channel Conflicts