cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

nibal
Challenger

Optimization Guide: GCN Channel Conflicts

p 44:

"In this example:


for (ptr=base; ptr<max; ptr += 16KB)
     R0 = *ptr ;


where the lower bits are all the same, the memory requests all access the same
bank on the same channel and are processed serially.
This is a low-performance pattern to be avoided. When the stride is a power of
2 (and larger than the channel interleave), the loop above only accesses one
channel of memory."

Agreed with the reasoning, disagree with conclusion and scenario. I think that this is what exactly

we want in a kernel. The code in the loop should run serially for any given kernel (aside from

compiler optimizations, that may parallelize instructions), so that parallel kernels have the chance

with a base offset to use different channels. To that effect, unit strides, mentioned elsewhere in the same

page, would be the worst possible scenario.

Also to my understanding only memory writes can be conflicted. No reason for memory reads to be.

Am I missing smt?

0 Likes
14 Replies
sandyandr
Adept I

Re: Optimization Guide: GCN Channel Conflicts

They talk not about different kernels, but the same kernel, which is split "vertically" into different work-items, which run there simultaneously. Loop goes serially for each work-item and nothing can change this behaviour (you can only say to compiler: "make several iterations look as the same part was written several times"). Anyway, it will run them serially an try to finish the loop at the same time for all work-items (if the number of iterations is the same for each work-item). So, as each work-item does the same job for different data - at each moment, this data shouldn't go to/from the same bank/channel - in order to avoid conflicts. As I remember, reading conflicts will not occur only under certain conditions for read-only memory (images, constants, etc.), so, you'd better try to avoid them as well.

0 Likes
nibal
Challenger

Re: Optimization Guide: GCN Channel Conflicts

Thanks for your fast reply,

I really understand that the work items in a synchronized wavefront are all parallel kernels executing the same instruction step on different data. Let's consider 2 of those work items:

char *ptr;

A)

for (ptr = base, ptr < max; ptr += 16384) // 16384 = 2^14

      R0 = *ptr;

B)

for (ptr = base + 512; ptr < max; ptr += 16384) // 512 = 2^9

     R0 = *ptr;

Bits 8..10 (256 - 1024) are used for different channels.

Case (A) will run always from channel A, since bits 8..10 are the same during the whole iteration.

Case (B) will run always from channel B, since bits 8..10 are the same during the whole iteration.

However, channel B # channel A, since base offset (512) changes bit 9.

I don't see any conflict.

For the same token, if stride is 1 instead of 16384, after 256 iterations case (A) will change channel and could conflict.

0 Likes
sandyandr
Adept I

Re: Optimization Guide: GCN Channel Conflicts

For the same token, if stride is 1 instead of 16384, after 256 iterations case (A) will change channel and could conflict.

case (B) will change bit 8 at the same time. Bit 9 should still be opposite to (A).

0 Likes
nibal
Challenger

Re: Optimization Guide: GCN Channel Conflicts

> case (B) will change bit 8 at the same time. Bit 9 should still be opposite to (A).

Not exactly. that depends on the offset value. According to it, channel B could change earlier or later. Point being, is that at low strides, both stride and offset affect channel switching, and therefore possible conflict, whereas in high strides these 2 are decoupled and offset alone can dictate channel. In case (B) with an offset of 512 and stride of 1, work item will change channel with every step 😞

If you know your memory topology, using a large stride, you can control in which channel you want each work item to run. Of course you got to have the memory to support big strides...

0 Likes
sandyandr
Adept I

Re: Optimization Guide: GCN Channel Conflicts

I think you're wrong:

B)

for (ptr = base + 512; ptr < max; ptr++)

     R0 = *ptr;

This work-item will change ptr with each iteration in the same way and at the same moment, as work-item (A) will increment its ptr (in each iteration lower bits (8:0) of A's and B's ptrs will be equal) - it's OK and can't lead to channel conflict - all differences start from bit 9 anyway. The problem is that these small regions will too soon overlap already processed ones (processed by the neighboring work-item) - that's why they said 16K, I guess.

0 Likes
nibal
Challenger

Re: Optimization Guide: GCN Channel Conflicts

Let me give you an example:

char *ptr = (char *) 16384;           // or any other high memory with first 10 bits set to 0

A)

for (ptr; ptr < max; ptr++)

          R0 = *ptr;

This will change channel after 256 iterations

B)

for (ptr + 512; ptr < max; ptr++)

     R0 = *ptr;

This will change channel immediately in iterations: +512, +513, +514...+1024

0 Likes
sandyandr
Adept I

Re: Optimization Guide: GCN Channel Conflicts

If channel is defined by bits 10:8, then

A - will change the channel from "000" to "001" in 256th iteration, to "010" in 512th iteration, to "011" in 768th iteration, to "100" in 1024th iteration and so on.

B - will change the channel from "010" to "011" in 256th iteration, to "100" in 512th iteration, to "101" in 768th iteration, to "110" in 1024th iteration and so on.

As you may see bit 8 is the same for A and B all the time. Bit 9 is opposite all the time. Bit 10 can vary. Actually, this rule will work here for any starting address (your "char *ptr = (char *) 16384;" is not necessary here).

nibal
Challenger

Re: Optimization Guide: GCN Channel Conflicts

16384 was just an example to ilustrate a point. Any memory could have been used.

You are right stride 1, doesn't hurt, although it starts changing channels quickly in a synchronized manner.

Large strides, stay put at the initial channels.

Thanks for clarifying the stride 1 case.

Still I can't see the point of the guide that large strides power of 2 are to be avoided...

0 Likes
sandyandr
Adept I

Re: Optimization Guide: GCN Channel Conflicts

As I may guess, Guide tries to say, that the worst scenario is when all bank:channel+lower bits are the same for all work-items - that's why large power of two for strides is bad. The best scenario is when your adjacent work-items read (or write) adjacent memory addresses, while all channels/banks are equally utilized by CUs, though there will certainly be a lot of bank/channel conflicts. Anyway, you obviously need to select X and Y in "for (ptr = base + X; ptr < max; ptr += Y)" and work-group size in such a way you could process the whole amount of data while utilizing all channels/banks uniformly.

0 Likes