AnsweredAssumed Answered

Work groups per compute unit

Question asked by binghy on Jan 27, 2015
Latest reply on Jan 30, 2015 by binghy

Hi everybody,

I would ask you about the number of work groups per compute unit.

I read so many times the sentence: "a processing resource capable of supporting a work-Group is called compute unit. Each work-Group executes on a single compute unit, and each compute unit executes only one work Group at a time". So, there is absolutely no concurrency of work-groups on the same compute unit at the same time, and there is no knowledge about concurrency of work-groups among different compute units, right? They could execute concurrently, or they could not,.

Then, searching over the forum for explanations, I found this discussion: Re: How do I get the number of work groups?

Here, there is written once that "a tahiti card with 32 cores can have 8 workgroups per core due to barrier resources", and later that "So, thinking of Tahiti: 1) You can only have up to 8 workgroups per CU". What does it mean? Is this in contrast with the sentence that I wrote before? Or is this a new capability of GCN cards, with the presence of Asynchronous Compute Engines? (even if I think that ACEs are just related to different tasks/kernels, or to the same task for which the input buffer has been spawn among the hardware queues, executed concurrently on the device)

Moreover, I would like to ask you something more about what I read in the previous post.


So, thinking of Tahiti:

1) You can only have up to 8 workgroups per CU

2) There is 64kBytes of LDS per CU. If each workgroup uses 16kBytes you may have up to 4 workgroups on the CU.

2) There are 256 registers per SIMD unit (1/4 CU). If each wavefront uses 16 registers you may have to 16 wavefronts on the SIMD unit, or 64 for the CU.

4) There are 32 CUs so you can multiply the per-CU numbers up accordingly.



1) "If each workgroup uses 16kBytes you may have up to 4 workgroups on the CU"

     a) Does it mean concurrently or scheduled serially?

     b) If scheduled, after 4 groups processed, the memory is completely freed to allow processing of other groups?

     c) For having up to 8 work-groups per CU, does it mean that a single WG should use as a limit a memory of 8192bytes? (65536bytes of LDS / 8)


2) I don't understand how to map registers with the rest. I mean, according to AMD Accelerated Parallel Processing guide, GCN devices have 4 vector units (SIMD units), each with 16 processing elements, and each maps to one ALU (phisically) or one work-item (from the software point of view).

     a) So what is 256 registers per SIMD unit (1/4 CU)?

     b) Are registers mapped to wavefronts ("If each wavefront uses 16 registers you may have to 16 wavefronts on the SIMD unit")?

     c) From the upper sentence, it seems that there will be 64 wavefronts for the CU. I got lost with numbers and element mapping. Is 64 the number of work-items contained in a single wavefront?


In the end, just a suggestion about the AMD APP guide. I read in the past the 2012 version and it was completely messy about the description of the hardware elements. Now I downloaded the 2014 version that I'm still reading. I saw that it's more ordinated and a bit more clear! =) Even if in Appendix D, about pre-GCN devices, there are still sentences like this: "Processing elements, in turn, contain numerous processing elements". No sense at all. Please, could you write in the nearest future a very clear, complete and understandable hardware elements/mapping review/topic?

Moreover, I just work with laptop graphics card. While desktop GPUs are well documented, this is not true for laptop GPUs. I had so many difficulties (crossing different informations among different web sources) to understand that my actual laptop GPU (AMD Radeon R9 M290X) is first of all equivalent to AMD Radeon HD 8970M, and that it is part of the "Solar System family" (Neptune architecture), which is in turn the mobility version of the "Sea Island family" for desktop GPUs. Hope my deductions are right, otherwise I get crazy once again. Please, could you write a clear review about laptop/desktop GPUs families/architectures, at least highlighting which family each graphics card map to?


Thank you in advance.


Hope to hear from you soon to clean my doubts!