I was not able to find 16 processing elemets/CU on pg-64.Can you please specify the correct page.
As per your confusion.Each compute unit contains 16 stream cores.One single wavefront runs on a single core and hence at any particular time only a quad-wavefront(16 work items run).
Each Stream core contains 5 VLIW processing elements which can execute 4general +1 trancendental independent instruction from the same workitem.
First and second row: "The GPU consists of multiple compute units. Each compute unit contains 32 kB local (on-chip) memory, L1 cache, registers, and 16 processing elements. "But after reading more I get how processing elements, stream cores and wavefronts are connected. It's just that the programming guide is a bit contradicting.Now I have a question regarding wavefronts. If I have work-group size 128 I'll fill 2 whole wavefronts and utilize the device fully. But if I have work-group size 32, will 2 work-groups fill one wavefront? Or will the wavefront be half utilized?If the second answer is 'yes' than that means that all work-group sizes smaller than 64 will result in bad usage of the device. Some algorithms, like AES, has highest suitable work-group size 16. Are these unsuitable by design to run on ATI cards with that big wavefront size?Many thoughts, hopefully you got some answers to them. Thanks.
Actually in openCL spec they call an entity runnint a workitem as processing element.But AMD call it stream core.So it becomes confusing sometimes.
Actually many workgroups can run on a single compute unit if they can fit in it.
So if a workgroup has 128 workitems.A CU can execute two such workgroups if other resources permit.The implementation always tries to allocate 4 workitems to each stream core to hide latencies.
Thanks for replying. Yes, many work-groups can reside on one CU. But the question is about wavefronts, since each wavefront occupies the CU for 4 cycles.
Can more than 1 work-group fill a wavefront?
Work-group size 64: all 4 cycles of the wavefront are filled.
Work-group size 128: all 8 cycles of 2 wavefronts are filled.
Work-group size 32: 2 cycles of the wavefront are unused? or 2 work-groups in the wavefront?
Work-group size 1: the CU is occupied 4 cycles but only 1 stream core does work for 1 cycle? or 64 work-groups in the wavefront?
If only the same work-group can fill a wavefront, all implementations with work-group size <64 are unable to fully utilize the hardware. Or do I've got it wrong?
You are right.
CU are not utilized completely if the work group size<64.