I have two questions that I can explain using this scenario
Lets say I have a kernel that can be grouped as a workgroup of 64 work-items i.e. 1 wavefront. I get this number from clGetKernelWorkGroupInfo api of OpenCL. I assume this api calculates this based on register allocation. Also from the same api I can get the local memory usage by the kernel. Dividing the total local memory (x 2 for GCN arch) by kernel local memory usage, I get maximum workgroups I can fit per compute unit (CU). Subsequently I can get workgroups I can fit in gpu, lets call this number "workgroup-gpu".
- I remember reading in the forums that only one workgroup executes at a time on CU. So how does extra workgroups/CU help hiding memory latency?
- Is there any other reason to put more than "workgroup-gpu" workgroups in the GPU, as the rest are executed sequentially?