Well it helps hide memory latency because when you execute read instruction from first wavefront it can takes several cycles until it read from memory. But then it can execute same read instruction from second wavefront and third. So when the first wavefront takes the result of read instruction is ready.
I understand wavefront does that. But is it same with workgroup as well? If only workgroup is executed in compute unit at a time and if that workgroup consists of only one wavefront, then how does the GPU hide memory latency? By switching workgroups?
1 of 1 people found this helpful
Let me clarify few points.
- Work-items are processed in group called wavefront (WV). Wavefronts are like hardware threads. Each has own program counter and can run independently of each others.
- A CU can have one or more work-groups for processing. Each work-group is divided into one or more wavefronts depending on work-group size.
- Each CU consists of one or more SIMD units. Each SIMD executes one instruction (VLIW instruction in case of VLIW architecture) from a wavefront at a time. For example, In GCN, each CU has four SIMDs so, it can executes four wavefronts simultaneously.
- Each SIMD has a wavefront queue consisting of one or more in-flight wavefronts (may be from different work-groups or different kernels). For example, In GCN, the queue length is 10 so, max. 40 wavefronts can be in-flight in a CU.
- During wavefront scheduling, one wavefront is chosen from the queue depending on some rules and assigned to SIMD for execution. These in-flight wavefronts are used to hide latency.
Thank you. That cleared things up.