just to clarify: executing 4 workgroups with 1 wavefront each per CU is gonna be as fast as executing
1 workgroup of 256 work items? (ignoring minor overhead assiciated with workgroup scheduling).
i.e. is it ok to run 1 workgroup of size 256 per CU?
and if yes: if 4 wavefronts work on the the same 64 data paths, while sharing intermediate results via LDS,
the same work on 64 data paths will take 1/4 of the time compared to working on 256 data paths with one thread
doint all the work instead of 4 threads coordinating per data path via LDS? (ignoring the overhead that would
result from wavefronts not reaching barriers at exactly the same time).
also the wavefronts would be executing barrier instructions in synchronization with other threads but the wavefronts
would diverge and thus the barriers would not be at the same address in the instruction stream. is that legal?
thanks for helping.