cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

binghy
Adept II

Work groups per compute unit

Hi everybody,

I would ask you about the number of work groups per compute unit.

I read so many times the sentence: "a processing resource capable of supporting a work-Group is called compute unit. Each work-Group executes on a single compute unit, and each compute unit executes only one work Group at a time". So, there is absolutely no concurrency of work-groups on the same compute unit at the same time, and there is no knowledge about concurrency of work-groups among different compute units, right? They could execute concurrently, or they could not,.

Then, searching over the forum for explanations, I found this discussion: Re: How do I get the number of work groups?

Here, there is written once that "a tahiti card with 32 cores can have 8 workgroups per core due to barrier resources", and later that "So, thinking of Tahiti: 1) You can only have up to 8 workgroups per CU". What does it mean? Is this in contrast with the sentence that I wrote before? Or is this a new capability of GCN cards, with the presence of Asynchronous Compute Engines? (even if I think that ACEs are just related to different tasks/kernels, or to the same task for which the input buffer has been spawn among the hardware queues, executed concurrently on the device)

Moreover, I would like to ask you something more about what I read in the previous post.

[..]

So, thinking of Tahiti:

1) You can only have up to 8 workgroups per CU

2) There is 64kBytes of LDS per CU. If each workgroup uses 16kBytes you may have up to 4 workgroups on the CU.

2) There are 256 registers per SIMD unit (1/4 CU). If each wavefront uses 16 registers you may have to 16 wavefronts on the SIMD unit, or 64 for the CU.

4) There are 32 CUs so you can multiply the per-CU numbers up accordingly.

[...]

1) "If each workgroup uses 16kBytes you may have up to 4 workgroups on the CU"

     a) Does it mean concurrently or scheduled serially?

     b) If scheduled, after 4 groups processed, the memory is completely freed to allow processing of other groups?

     c) For having up to 8 work-groups per CU, does it mean that a single WG should use as a limit a memory of 8192bytes? (65536bytes of LDS / 😎

2) I don't understand how to map registers with the rest. I mean, according to AMD Accelerated Parallel Processing guide, GCN devices have 4 vector units (SIMD units), each with 16 processing elements, and each maps to one ALU (phisically) or one work-item (from the software point of view).

     a) So what is 256 registers per SIMD unit (1/4 CU)?

     b) Are registers mapped to wavefronts ("If each wavefront uses 16 registers you may have to 16 wavefronts on the SIMD unit")?

     c) From the upper sentence, it seems that there will be 64 wavefronts for the CU. I got lost with numbers and element mapping. Is 64 the number of work-items contained in a single wavefront?

In the end, just a suggestion about the AMD APP guide. I read in the past the 2012 version and it was completely messy about the description of the hardware elements. Now I downloaded the 2014 version that I'm still reading. I saw that it's more ordinated and a bit more clear! 😃 Even if in Appendix D, about pre-GCN devices, there are still sentences like this: "Processing elements, in turn, contain numerous processing elements". No sense at all. Please, could you write in the nearest future a very clear, complete and understandable hardware elements/mapping review/topic?

Moreover, I just work with laptop graphics card. While desktop GPUs are well documented, this is not true for laptop GPUs. I had so many difficulties (crossing different informations among different web sources) to understand that my actual laptop GPU (AMD Radeon R9 M290X) is first of all equivalent to AMD Radeon HD 8970M, and that it is part of the "Solar System family" (Neptune architecture), which is in turn the mobility version of the "Sea Island family" for desktop GPUs. Hope my deductions are right, otherwise I get crazy once again. Please, could you write a clear review about laptop/desktop GPUs families/architectures, at least highlighting which family each graphics card map to?

Thank you in advance.

Hope to hear from you soon to clean my doubts!

Marco

0 Likes
1 Solution

Luckly, my testing today requires a few minutes to stabilize results so here we go again...


binghy wrote:



1) Resource limits on active wavefronts


AMD GPUs have two important global resource constraints that limit the number of in-flight wavefronts:


• Southern Islands devices support a maximum of 16 work-groups per CU if a work-group is larger than one wavefront.


• The maximum number of wavefronts that can be scheduled to a CU is 40, or 10 per Vector Unit.



     I did not know about pipelining WG execution. But...


     Example: if a work-group is larger than one wavefront (let's say it's the maximum size, so WG = 256, hence 4 WFs), for a maximum of 16 WGs per CU, I should obtain a maximum number of WFs equal to: 16 WGs * 4 WFs/WG = 64 WFs per CU.


     Why 40?


Because it's the hardware limit. A WF is a "wide thread". It has an instruction pointer, other state such as number of pending I/O. This state must go somewhere in hi-perf, low-latency memory to be used effectively so it takes serious die space. The engineers decided 10 would have been enough to saturate the ALU. Besides state, the hardware scheduler might require more datapaths to mangle 16 states instead of 10. You can be fairly certain the details are AMD secret sauce.


2) "So, if the limiting factor is VGPR=84 you will get three kernel instances running on each SIMD lane."



     Still have to read and understand about VGPRs. Is VGPRs = 84 an example limit or a physical limit?


It's a random count I pulled out of thin air. Say we have an example kernel which, compiled with a certain driver, ends up requiring 84 VGPRs.


3) "At each moment there is only 1 wavefront (and thus workgroup) active for each SIMD and 1-4 WGs per CU as a WG can take up to 4 wavefronts."



     So the sentence written on every book and guide "Each work-Group executes on a single compute unit, and each compute unit executes only one work Group at a time" seems uncorrect to me. Or it depends on the WG size.


It depends on WG size. In theory workgroups can be of arbitrary dimensions up to a device limit (CL_DEVICE_MAX_WORK_GROUP_SIZE) but in practice they must count 64n WIs in AMD GCN. The biggest workgroup can have up to CL_DEVICE_MAX_WORK_GROUP_SIZE=256 WIs in GCN so they always fit a CU and can go to LDS without tripping over expensive paths.


Moreover, please, let's continue to define things with their own name. A wavefront (WF) is a collection of WIs (on the new GPUs 64, on older GPUs 32), and a WG is a collection of wavefronts (WF), at least 1 WF till a maximum of 4 WFs. Usually I guess a huge number of people use 4 WFs per WG to maximize performance (256 WIs per WG), so 1 WG = 1 WF is just a case.


You will find out in AMD APP using 1 WG = 1 WF is just a case but an important one as it requires no syncing across WIs in the same WG=WF. This isn't the case in all other cases.


4) "To have some chance at hiding memory latency, you need to keep at least one wavefront ready to go at SIMD level besides the one you're mangling so that's your 8 workgroups"



     8 WGs? Where does it come from?


1 is the WF just mangled this clock. It has stalled so there's nothing to do for it next clock.

So we have another one ready to go while the memory controller does its task. For each SIMD in a CU.

(1+1)x4 = 8 WF. Or 8WG, if workgroup size is 64.

The number originated from the discussion you linked but again, I think the original statement "tahiti card with 32 cores can have 8 workgroups per core due to barrier resources" might relate to a constrained scenario, as AMD APP reports twice the amount. I am a bit confused myself when it comes to that count.


5) Got it about sequential/concurrent execution. Sequential on the single CU (SIMD lane / CU level), concurrent among different CUs (device level).


Concurrent not across different CUs, as far as I recall 1 WG always goes to 1 CU. The CUs have internal, finer grained schedulers. The big device-level scheduler assigns WGs to CUs and CUs only report when a WG is completed so they can get more work. Because they timeslice WFs/WGs internally the device level scheduler just sees multiple WGs assigned to a CU, I don't think it even knows if they're active or not.

Reading again my previous statement, I think it's prone to misinterpretation.

The CU knows it must time-slice the various WFs so they run in some sequence as there are more WFs than SIMD lanes. However, they are not run in the "expected" sequence. I don't think the SIMD lane runs them in the "right" sequence either but they have to be timesliced. I don't think the SIMD lanes self-schedule themselves... perhaps they do and in that case even the CU would have to consider them fully concurrent. Sure thing the SIMD are independent so at high-level those have to be considered concurrent. But at low level, the scheduler knows they run in some sequence and it is my understanding this happens at CU level so multiple SIMDs can be synchronized on need.


6) "Think at a SIMD lane as a 64-columns row. Each column counts 256 32-bit registers "in column". If you visualize this that way, using a variable / register instead of another is equivalent to using "a row" in this matrix of registers. All VALU operations must index the same register / row, and bad things will happen if they don't."



     1 SIMD lane = 1 VU or 1 SIMD lane = collection of 4 VUs (GCN devices)? You wrote "Think at a SIMD lane as a 64-columns row", and on the guide it seems to me there is the corrispondence between 1 VU and 1 SIMD unit. This could agree with the      sentence "a wavefront is completed in four clock cycles".


You're on the right path.

The catch is that the SIMD lanes are really 16 elements wide. See pages 28-29 of this presentation. This is also why I previously wrote that a WG must be 16x4 WIs. As you go further down the optimization guide you'll see it's sometimes worth thinking at WFs as 32x2 elements instead.

View solution in original post

0 Likes
4 Replies
maxdz8
Elite

As far as I can tell, the statements you read in the linked discussion are not relative to the GPU by itself but rather on the GPU architecture when running the specific kernel being discussed. See "Where are you getting these maximum workgroup numbers?" "These numbers I get while running kernel analyzer".

See section 6.6.2 of AMD APP


Resource limits on active wavefronts


AMD GPUs have two important global resource constraints that limit the number of in-flight wavefronts:


• Southern Islands devices support a maximum of 16 work-groups per CU if a work-group is larger than one wavefront.


• The maximum number of wavefronts that can be scheduled to a CU is 40, or 10 per Vector Unit.


It works that way: the scheduler will try to pipeline on the same SIMD lane multiple instances of the same kernel.

So, if the limiting factor is VGPR=84 you will get three kernel instances running on each SIMD lane. For that kernel. The amount of wavefronts ("hardware-level workgroups") fitting the limiting resource is related to "occupancy" which is often expressed as a percentage of maximum (I like to talk about it in terms of wavefronts scheduled instead).

If the limiting factor is LDS, the whole thing is more complicated - somehow, the CodeXL analyzer reports twice the occupancy on GCN1.1 and 1.2.

The wavefronts (parts of workgroups) will get scheduled to each CU and to each SIMD lane. At each moment there is only 1 wavefront (and thus workgroup) active for each SIMD and 1-4 WGs per CU as a WG can take up to 4 wavefronts.

To have some chance at hiding memory latency, you need to keep at least one wavefront ready to go at SIMD level besides the one you're mangling so that's your 8 workgroups (assuming 1 WG == 1 wavefront).

The workgroup is the most fine-grained level of synchronization so if multiple workgroups can be scheduled to a SIMD lane (common when 1 workgroup = 1 wavefront) there will be multiple workgroups active at device level (even for a single SIMD). This is because the sequential mangling happens only at CU level. So they are sequential from a SIMD lane / CU point of view...

... but they are effectively concurrent at device level, as the device "thinks" in longer time terms in some sense and all those workgroups are potentially active. They will be activated in some order depending on memory access. So, for (1a), answer depends on context.


binghy wrote:



1) "If each workgroup uses 16kBytes you may have up to 4 workgroups on the CU"


     a) Does it mean concurrently or scheduled serially?


     b) If scheduled, after 4 groups processed, the memory is completely freed to allow processing of other groups?


     c) For having up to 8 work-groups per CU, does it mean that a single WG should use as a limit a memory of 8192bytes? (65536bytes of LDS / 😎



(1b)

The resources are statically allocated per-workgroup. They are gone as soon as the kernel exits.

(1c)

Max LDS allocation is 32KiB in CL (it seems you can use 64 KiB in assembly programs); still add up to 8 WG/CU.


binghy wrote:



2) I don't understand how to map registers with the rest. I mean, according to AMD Accelerated Parallel Processing guide, GCN devices have 4 vector units (SIMD units), each with 16 processing elements, and each maps to one ALU (phisically) or one work-item (from the software point of view).


     a) So what is 256 registers per SIMD unit (1/4 CU)?


     b) Are registers mapped to wavefronts ("If each wavefront uses 16 registers you may have to 16 wavefronts on the SIMD unit")?


     c) From the upper sentence, it seems that there will be 64 wavefronts for the CU. I got lost with numbers and element mapping. Is 64 the number of work-items contained in a single wavefront?


(2a)

Think at a SIMD lane as a 64-columns row. Each column counts 256 32-bit registers "in column". If you visualize this that way, using a variable / register instead of another is equivalent to using "a row" in this matrix of registers. All VALU operations must index the same register / row, and bad things will happen if they don't.

(2b)

Yes and no. You could have 16 wavefronts in theory (but if you're thinking "in theory" you're probably not counting registers either). In practice there's an hardware limit of 10 wavefronts per SIMD lane <--> 40 per CU. It is a fairly generous limit.

(2c)

I'm not sure what you trying to ask here but the number of WIs in a wavefront is an hardware design decision. For AMD GCN it's 16x4 elements.


binghy wrote:



Moreover, I just work with laptop graphics card. While desktop GPUs are well documented, this is not true for laptop GPUs. I had so many difficulties (crossing different informations among different web sources) to understand that my actual laptop GPU (AMD Radeon R9 M290X) is first of all equivalent to AMD Radeon HD 8970M, and that it is part of the "Solar System family" (Neptune architecture), which is in turn the mobility version of the "Sea Island family" for desktop GPUs. Hope my deductions are right, otherwise I get crazy once again. Please, could you write a clear review about laptop/desktop GPUs families/architectures, at least highlighting which family each graphics card map to?


Agreed. Last version of CodeXL has a very nice hierarchical list of all the GPUs produced. It's not a complete list in terms of shelf names but that's something. They are grouped by architecture revision so hopefully they are enough for any use... I don't think M290x is there unfortunately.

First of all, I thank you so much for replying me so soon. I really appreciate.

Anyway, even if some points are becoming a bit more clear, there are still things so messy in my head.

First of all, I admit to have not read till the end the AMD guide, so I missed the sections about maximum WG scheduling. I am still facing memory optimization section, so hard. But even if I still miss some chapters, some doubts arise about things in your answer. Please, let's limit to GCN devices "family" only.

Just to summarize:

1) Resource limits on active wavefronts

AMD GPUs have two important global resource constraints that limit the number of in-flight wavefronts:

• Southern Islands devices support a maximum of 16 work-groups per CU if a work-group is larger than one wavefront.

• The maximum number of wavefronts that can be scheduled to a CU is 40, or 10 per Vector Unit.

     I did not know about pipelining WG execution. But...

     Example: if a work-group is larger than one wavefront (let's say it's the maximum size, so WG = 256, hence 4 WFs), for a maximum of 16 WGs per CU, I should obtain a maximum number of WFs equal to: 16 WGs * 4 WFs/WG = 64 WFs per CU.

     Why 40?

2) "So, if the limiting factor is VGPR=84 you will get three kernel instances running on each SIMD lane."

     Still have to read and understand about VGPRs. Is VGPRs = 84 an example limit or a physical limit?

3) "At each moment there is only 1 wavefront (and thus workgroup) active for each SIMD and 1-4 WGs per CU as a WG can take up to 4 wavefronts."

     So the sentence written on every book and guide "Each work-Group executes on a single compute unit, and each compute unit executes only one work Group at a time" seems uncorrect to me. Or it depends on the WG size.

    

     Moreover, please, let's continue to define things with their own name. A wavefront (WF) is a collection of WIs (on the new GPUs 64, on older GPUs 32), and a WG is a collection of wavefronts (WF), at least 1 WF till a maximum of 4 WFs. Usually I guess a      huge number of people use 4 WFs per WG to maximize performance (256 WIs per WG), so 1 WG = 1 WF is just a case.

4) "To have some chance at hiding memory latency, you need to keep at least one wavefront ready to go at SIMD level besides the one you're mangling so that's your 8 workgroups"

     8 WGs? Where does it come from?

5) Got it about sequential/concurrent execution. Sequential on the single CU (SIMD lane / CU level), concurrent among different CUs (device level).

6) "Think at a SIMD lane as a 64-columns row. Each column counts 256 32-bit registers "in column". If you visualize this that way, using a variable / register instead of another is equivalent to using "a row" in this matrix of registers. All VALU operations must index the same register / row, and bad things will happen if they don't."

     1 SIMD lane = 1 VU or 1 SIMD lane = collection of 4 VUs (GCN devices)? You wrote "Think at a SIMD lane as a 64-columns row", and on the guide it seems to me there is the corrispondence between 1 VU and 1 SIMD unit. This could agree with the      sentence "a wavefront is completed in four clock cycles".

     Moreover, 1 vector unit (VU) made of 16 processing elements (PEs) each? 1 PE == 1 WI (for GCN devices)?

     According to the Matrix of registers example, so each element of the SIMD lane (hence each PE, hence each WI?) can access 256 different registers?

Hope to have not stressed you very much, but I need clarifications before moving on.

Best regards,

Marco

0 Likes

Luckly, my testing today requires a few minutes to stabilize results so here we go again...


binghy wrote:



1) Resource limits on active wavefronts


AMD GPUs have two important global resource constraints that limit the number of in-flight wavefronts:


• Southern Islands devices support a maximum of 16 work-groups per CU if a work-group is larger than one wavefront.


• The maximum number of wavefronts that can be scheduled to a CU is 40, or 10 per Vector Unit.



     I did not know about pipelining WG execution. But...


     Example: if a work-group is larger than one wavefront (let's say it's the maximum size, so WG = 256, hence 4 WFs), for a maximum of 16 WGs per CU, I should obtain a maximum number of WFs equal to: 16 WGs * 4 WFs/WG = 64 WFs per CU.


     Why 40?


Because it's the hardware limit. A WF is a "wide thread". It has an instruction pointer, other state such as number of pending I/O. This state must go somewhere in hi-perf, low-latency memory to be used effectively so it takes serious die space. The engineers decided 10 would have been enough to saturate the ALU. Besides state, the hardware scheduler might require more datapaths to mangle 16 states instead of 10. You can be fairly certain the details are AMD secret sauce.


2) "So, if the limiting factor is VGPR=84 you will get three kernel instances running on each SIMD lane."



     Still have to read and understand about VGPRs. Is VGPRs = 84 an example limit or a physical limit?


It's a random count I pulled out of thin air. Say we have an example kernel which, compiled with a certain driver, ends up requiring 84 VGPRs.


3) "At each moment there is only 1 wavefront (and thus workgroup) active for each SIMD and 1-4 WGs per CU as a WG can take up to 4 wavefronts."



     So the sentence written on every book and guide "Each work-Group executes on a single compute unit, and each compute unit executes only one work Group at a time" seems uncorrect to me. Or it depends on the WG size.


It depends on WG size. In theory workgroups can be of arbitrary dimensions up to a device limit (CL_DEVICE_MAX_WORK_GROUP_SIZE) but in practice they must count 64n WIs in AMD GCN. The biggest workgroup can have up to CL_DEVICE_MAX_WORK_GROUP_SIZE=256 WIs in GCN so they always fit a CU and can go to LDS without tripping over expensive paths.


Moreover, please, let's continue to define things with their own name. A wavefront (WF) is a collection of WIs (on the new GPUs 64, on older GPUs 32), and a WG is a collection of wavefronts (WF), at least 1 WF till a maximum of 4 WFs. Usually I guess a huge number of people use 4 WFs per WG to maximize performance (256 WIs per WG), so 1 WG = 1 WF is just a case.


You will find out in AMD APP using 1 WG = 1 WF is just a case but an important one as it requires no syncing across WIs in the same WG=WF. This isn't the case in all other cases.


4) "To have some chance at hiding memory latency, you need to keep at least one wavefront ready to go at SIMD level besides the one you're mangling so that's your 8 workgroups"



     8 WGs? Where does it come from?


1 is the WF just mangled this clock. It has stalled so there's nothing to do for it next clock.

So we have another one ready to go while the memory controller does its task. For each SIMD in a CU.

(1+1)x4 = 8 WF. Or 8WG, if workgroup size is 64.

The number originated from the discussion you linked but again, I think the original statement "tahiti card with 32 cores can have 8 workgroups per core due to barrier resources" might relate to a constrained scenario, as AMD APP reports twice the amount. I am a bit confused myself when it comes to that count.


5) Got it about sequential/concurrent execution. Sequential on the single CU (SIMD lane / CU level), concurrent among different CUs (device level).


Concurrent not across different CUs, as far as I recall 1 WG always goes to 1 CU. The CUs have internal, finer grained schedulers. The big device-level scheduler assigns WGs to CUs and CUs only report when a WG is completed so they can get more work. Because they timeslice WFs/WGs internally the device level scheduler just sees multiple WGs assigned to a CU, I don't think it even knows if they're active or not.

Reading again my previous statement, I think it's prone to misinterpretation.

The CU knows it must time-slice the various WFs so they run in some sequence as there are more WFs than SIMD lanes. However, they are not run in the "expected" sequence. I don't think the SIMD lane runs them in the "right" sequence either but they have to be timesliced. I don't think the SIMD lanes self-schedule themselves... perhaps they do and in that case even the CU would have to consider them fully concurrent. Sure thing the SIMD are independent so at high-level those have to be considered concurrent. But at low level, the scheduler knows they run in some sequence and it is my understanding this happens at CU level so multiple SIMDs can be synchronized on need.


6) "Think at a SIMD lane as a 64-columns row. Each column counts 256 32-bit registers "in column". If you visualize this that way, using a variable / register instead of another is equivalent to using "a row" in this matrix of registers. All VALU operations must index the same register / row, and bad things will happen if they don't."



     1 SIMD lane = 1 VU or 1 SIMD lane = collection of 4 VUs (GCN devices)? You wrote "Think at a SIMD lane as a 64-columns row", and on the guide it seems to me there is the corrispondence between 1 VU and 1 SIMD unit. This could agree with the      sentence "a wavefront is completed in four clock cycles".


You're on the right path.

The catch is that the SIMD lanes are really 16 elements wide. See pages 28-29 of this presentation. This is also why I previously wrote that a WG must be 16x4 WIs. As you go further down the optimization guide you'll see it's sometimes worth thinking at WFs as 32x2 elements instead.

0 Likes

Thank you very, very much for your detailed explanation.

Since explanations and sentences seem to bite themeselves sometimes among AMD APP guide, presentations and books, I'm trying to have a complete scenario collecting all the right informations.

Thank you again, very helpful.

Good luck for your testing!

Best regards,

Marco

0 Likes