Archives Discussions

knightast · ‎03-12-2015

Hi, all

I'm using Kaveri A10-7850K. We know each computing unit have 10 wavefront and switch among them when stall happens.

I want to konw more about it, When wavefront switchs, the local memory is swap out? Where does it goes? Or there are private local memory for each wavefront?

Thanks in advance.

Best,

Heng

maxdz8 · ‎03-14-2015

knightast wrote:

If I understand correct, suppose there are 10 wavefronts in each computing unit, 64KB LDS in each wavefront, we get 640KB LDS in total.

No! It's the other way around! The hardware gives you a pool of LDS. Say 64KiB. If every wavefront consumes 64KiB then you can only run 1! LDS is an hardware construct. It's not like normal memory.

Note you cannot actually allocate 64KiB of LDS. In practice you'll want to stay below 8 KiB in my experience.

Your terminology is inaccurate. An AMD GCN comuting unit (CU) is 4 SIMD lanes. Wavefronts get scheduled to the SIMD lanes, each can hold 10 so there's a total of 40 Wavefronts at CU level.

knightast wrote:

Another question after reading your rely. Can one workgroup be scheduled on more than one wavefront?

Not quite because workgroups are not "scheduled to wavefronts". The wavefront is the "hardware level" workgroup. When you use a workgroup bigger than the wavefront, the device will build it out of multiple wavefronts. Wavefronts are scheduled to the SIMD lanes of a CU and processed independently from each other.

In other terms, a workgroup might be scheduled to multiple SIMD lanes (= wavefront processor). Maybe that was your question.

It is my understanding once a workgroup gets scheduled to a SIMD it stays there until completed.

View solution in original post

acekiller · ‎03-13-2015

The local memory resides in a compute unit. Each work group is assigned to a specific compute unit for execution. Thus, all work-items within that work group should see the local memory of that compute all the time. When you are about to launch the NDRange, a requirement is that the resources (registers, local memory) should be enough for your space size configuration, otherwise the launch will fail, which means: once you have launched your kernel, each wavefront have been assigned a bank of registers only belonging to that wavefront. So the context switching or resuming only involves choosing another a bank selector, which is different from CPUs, where registers have to be saved to or restored from RAM. That's why context switch cost is negligible on GPUs.

The above is based on my understanding, hope others can verify it.

maxdz8 · ‎03-13-2015

It's more or less correct. Note for example you can exceed VGPRs but still end up with something runnable due to the driver overspilling them to VRAM.

maxdz8 · ‎03-13-2015

knightast wrote:

I want to konw more about it, When wavefront switchs, the local memory is swap out?

No. LDS is a bank of memory which is "sliced up" to fit as many wavefronts as possible (up to 10, but usually less).

To that regard, LDS behaves as registers, as noted by acekiller.

Note the basic allocation unit of LDS is the workgroup, not the wavefront. You can consider LDS contents to be trashed when a workgroup terminates execution. Because a workgroup can count multiple wavefronts LDS cannot be considered private to a wavefront.

knightast · ‎03-13-2015

Thanks for maxdz8's reply.

If I understand correct, suppose there are 10 wavefronts in each computing unit, 64KB LDS in each wavefront, we get 640KB LDS in total.

Another question after reading your rely. Can one workgroup be scheduled on more than one wavefront?

maxdz8 · ‎03-14-2015

knightast wrote:

If I understand correct, suppose there are 10 wavefronts in each computing unit, 64KB LDS in each wavefront, we get 640KB LDS in total.

No! It's the other way around! The hardware gives you a pool of LDS. Say 64KiB. If every wavefront consumes 64KiB then you can only run 1! LDS is an hardware construct. It's not like normal memory.

Note you cannot actually allocate 64KiB of LDS. In practice you'll want to stay below 8 KiB in my experience.

Your terminology is inaccurate. An AMD GCN comuting unit (CU) is 4 SIMD lanes. Wavefronts get scheduled to the SIMD lanes, each can hold 10 so there's a total of 40 Wavefronts at CU level.

knightast wrote:

Another question after reading your rely. Can one workgroup be scheduled on more than one wavefront?

Not quite because workgroups are not "scheduled to wavefronts". The wavefront is the "hardware level" workgroup. When you use a workgroup bigger than the wavefront, the device will build it out of multiple wavefronts. Wavefronts are scheduled to the SIMD lanes of a CU and processed independently from each other.

In other terms, a workgroup might be scheduled to multiple SIMD lanes (= wavefront processor). Maybe that was your question.

It is my understanding once a workgroup gets scheduled to a SIMD it stays there until completed.

knightast · ‎03-15-2015

Thanks for your reply and also some correctness, the LDS is clear to me now.

In terms of SIMD lane, there are still some questions.

1. A workgroup cannot be scheduled across CU, result in a workgroup is scheduled to at most 4 SIMD lanes. Is this right?

2. Suppose the workgroup size is small, for example 32, and the workgroup number is large, fox example 80, we use only one CU, how will it be scheduled to SIMD lanes? Since each SIMD lane has 64 threads, will two workgroups be scheduled to one SIMD lane?

maxdz8 · ‎03-16-2015

knightast wrote:

1. A workgroup cannot be scheduled across CU, result in a workgroup is scheduled to at most 4 SIMD lanes. Is this right?

Yes, this is correct given current architecture.

knightast wrote:

2. Suppose the workgroup size is small, for example 32, and the workgroup number is large, fox example 80, we use only one CU, how will it be scheduled to SIMD lanes? Since each SIMD lane has 64 threads, will two workgroups be scheduled to one SIMD lane?

When using workgroups smaller than a wavefront you still burn a full wavefront.

Work items from different workgroups are not merged (whatever this could mean). Having 80 groups of 32 work is almost the same as having 80 groups of 64, both generate 80 wavefronts but in the former case, only the first half of a wavefront does useful work.

In other terms, 1 workgroup is always at least 1 wavefront.

So basically that's all wrong.

knightast · ‎03-16-2015

Got it, thanks for all your reply.

Archives Discussions

Local memory for wavefront switch