In OpenCL, a wavefront containing 64 work-items is scheduled each time. As all work-items work in lock-step manner, so even one work-item is delayed (encounter cache miss or else), then all other work-items have to wait for that one. Then what confuse me is that: because in actual scheuduling process, a quarter of the wavefront (i.e. 16 work-items) is scheduled onto GPU cores in one cycle, and the whole wavefront will be executed in 4 consequent cycles.
1) One work-item in the first quarter is delayed, all other three quarters will be delayed?
2) If only one work-item from the second quarter is delayed, then the first quarter will be not delayed, but the 3rd and 4th quarter will be delayed?
Is that true on AMD GPUs?