I have two questions that I can explain using this scenario
Lets say I have a kernel that can be grouped as a workgroup of 64 work-items i.e. 1 wavefront. I get this number from clGetKernelWorkGroupInfo api of OpenCL. I assume this api calculates this based on register allocation. Also from the same api I can get the local memory usage by the kernel. Dividing the total local memory (x 2 for GCN arch) by kernel local memory usage, I get maximum workgroups I can fit per compute unit (CU). Subsequently I can get workgroups I can fit in gpu, lets call this number "workgroup-gpu".
Well it helps hide memory latency because when you execute read instruction from first wavefront it can takes several cycles until it read from memory. But then it can execute same read instruction from second wavefront and third. So when the first wavefront takes the result of read instruction is ready.
I understand wavefront does that. But is it same with workgroup as well? If only workgroup is executed in compute unit at a time and if that workgroup consists of only one wavefront, then how does the GPU hide memory latency? By switching workgroups?
Let me clarify few points.