For CPU AMD supports 1024 workitems.
NV supports 1024 for GPUs from very beginning...
AMD has same or larger amount of shared memory, same or larger register file... so why such limitation?
Why only 4 waves per workigroup? If one need to share whole LDS he limited with only 4 wavefronts per CU no matter how many registers remained. But more waves in flight would result in better CU usage and latency hiding... So this 256 limitation looks like quite artifical and not good for performance in some cases.
Hence the question why? Why this arbitrary size of 256 was chosen? Are the reasons still important for new GPUs or they could have bigger workgroup size?