Archives Discussions

skanur · ‎09-23-2014

Hi,

I have two questions that I can explain using this scenario

Scenario:

Lets say I have a kernel that can be grouped as a workgroup of 64 work-items i.e. 1 wavefront. I get this number from clGetKernelWorkGroupInfo api of OpenCL. I assume this api calculates this based on register allocation. Also from the same api I can get the local memory usage by the kernel. Dividing the total local memory (x 2 for GCN arch) by kernel local memory usage, I get maximum workgroups I can fit per compute unit (CU). Subsequently I can get workgroups I can fit in gpu, lets call this number "workgroup-gpu".

Question:

I remember reading in the forums that only one workgroup executes at a time on CU. So how does extra workgroups/CU help hiding memory latency?
Is there any other reason to put more than "workgroup-gpu" workgroups in the GPU, as the rest are executed sequentially?

nou · ‎09-23-2014

Well it helps hide memory latency because when you execute read instruction from first wavefront it can takes several cycles until it read from memory. But then it can execute same read instruction from second wavefront and third. So when the first wavefront takes the result of read instruction is ready.

skanur · ‎09-23-2014

I understand wavefront does that. But is it same with workgroup as well? If only workgroup is executed in compute unit at a time and if that workgroup consists of only one wavefront, then how does the GPU hide memory latency? By switching workgroups?

dipak · ‎09-24-2014

Hi skanur,

Let me clarify few points.

Work-items are processed in group called wavefront (WV). Wavefronts are like hardware threads. Each has own program counter and can run independently of each others.
A CU can have one or more work-groups for processing. Each work-group is divided into one or more wavefronts depending on work-group size.
Each CU consists of one or more SIMD units. Each SIMD executes one instruction (VLIW instruction in case of VLIW architecture) from a wavefront at a time. For example, In GCN, each CU has four SIMDs so, it can executes four wavefronts simultaneously.
Each SIMD has a wavefront queue consisting of one or more in-flight wavefronts (may be from different work-groups or different kernels). For example, In GCN, the queue length is 10 so, max. 40 wavefronts can be in-flight in a CU.
During wavefront scheduling, one wavefront is chosen from the queue depending on some rules and assigned to SIMD for execution. These in-flight wavefronts are used to hide latency.

Regards,

skanur · ‎09-24-2014

dipak,

Thank you. That cleared things up.

Archives Discussions

Need for more workgroups