just to clarify: executing 4 workgroups with 1 wavefront each per CU is gonna be as fast as executing
1 workgroup of 256 work items? (ignoring minor overhead assiciated with workgroup scheduling).
i.e. is it ok to run 1 workgroup of size 256 per CU?
and if yes: if 4 wavefronts work on the the same 64 data paths, while sharing intermediate results via LDS,
the same work on 64 data paths will take 1/4 of the time compared to working on 256 data paths with one thread
doint all the work instead of 4 threads coordinating per data path via LDS? (ignoring the overhead that would
result from wavefronts not reaching barriers at exactly the same time).
also the wavefronts would be executing barrier instructions in synchronization with other threads but the wavefronts
would diverge and thus the barriers would not be at the same address in the instruction stream. is that legal?
thanks for helping.
Further to nou's comment just one workgroup of 256 work items is not likely to efficiently occupy the device. Any cache miss or barrier is likely to leave you with gaps in the instruction stream.
There isn't any significant difference between four groups of 64 and one of 256 unless you get benefits from reuse between the wavefronts, or (in the negative direction) if the larger groups don't mean you fully occupy the CU, reducing the total number of wavefronts and the device's ability to cover for stalls.