For CPU AMD supports 1024 workitems.
NV supports 1024 for GPUs from very beginning...
AMD has same or larger amount of shared memory, same or larger register file... so why such limitation?
Why only 4 waves per workigroup? If one need to share whole LDS he limited with only 4 wavefronts per CU no matter how many registers remained. But more waves in flight would result in better CU usage and latency hiding... So this 256 limitation looks like quite artifical and not good for performance in some cases.
Hence the question why? Why this arbitrary size of 256 was chosen? Are the reasons still important for new GPUs or they could have bigger workgroup size?
The AMD CU supports four simultaneous wave-fronts (of 64 work items). Each of these wave-fronts is executed on a processing element containing 16 ALUs. Threre are four PE per CU, so one CU at a time processes four wave-fronts, or 256 work-items.
The limit of 256 work-items per work-group is posed to give the scheduler a chance to schedule wave-fronts using SMT (simultaneous multi-threading). In SMT, two independent threads (or in AMD GPU context, two independent wave-fronts) share same computing resource, and the scheduler, knowing fully that they are independent, can schedule them independent of each other. This is also used in Intel's virtual eight core processors, where each core uses SMT to simultaneously process two independent threads.
Coming back to AMD GPUs, since work-groups execute independent of each other, a limit of 256 enables the scheduler to schedule work-groups in SMT fashion. If a wave-front from one work-group is waiting, the scheduler can easily schedule a wave-front from another work-group, exploiting their independence. With larger work-group size, it may be possible that only one work-group may be run on a CU, and since work-item independence is not guaranteed, SMT can not be used, affecting the performance negatively.
Thanks. So, it can be said that for AMD CPU not wave but workgroup is minimal scheduling unit. It can's swap execution of waves inside same workgroup.
Then situation I describe (single workgroup uses all LDS so only one workgroup per CU is allowed) can't be solved on AMD GPU by increasing workgroup limit cause AMD GPU scheduler has no ability to swap waves inside same workgroup on PE. This part looks little unbalanced then cause to make best usage of PE, that requires execution swap due to awaiting data from memory, one should limit LDS usage per workgroup making sure that few workgroups can fit in LDS simultaneously.
If you know could you compare AMD's approach in this area not with Intel CPU but with NV's GPU approach where much larger workgroups are allowed. But hardly one NV CU can process let say 1024 or 2048 for newer cards workitems w/o swapping waves. So, one can infere that NV architecture allows waves swapping on PE even if waves belong to the same large workgroup - feature that AMD GPUs lack of? Could you please comment this?
I was surprised about this as well.
I suppose this is a problem with regard to work-group functions which would mandate inter-CU communication?