Hi.
Another post from me. Hopefully someone can give me answers.
In the ATI_Stream_SDK_OpenCL_Programming_Guide.pdf it states on page 15-16 that each Compute Unit has 16 Stream Cores with 5 Processing Elements each, i.e. 80 processing elements/compute unit. But on page 64 it states that it is 16 processing elements/compute unit. What part of the ATI Stream architecture is considered a OpenCL Processing Element?
Is there, or will there be, a tool like Stream Kernel Analyser for Linux?
Kind regards.
hi eklund.n,
I was not able to find 16 processing elemets/CU on pg-64.Can you please specify the correct page.
As per your confusion.Each compute unit contains 16 stream cores.One single wavefront runs on a single core and hence at any particular time only a quad-wavefront(16 work items run).
Each Stream core contains 5 VLIW processing elements which can execute 4general +1 trancendental independent instruction from the same workitem.
First and second row: "The GPU consists of multiple compute units. Each compute unit contains 32 kB local (on-chip) memory, L1 cache, registers, and 16 processing elements. "
hi eklund.n
Actually in openCL spec they call an entity runnint a workitem as processing element.But AMD call it stream core.So it becomes confusing sometimes.
Actually many workgroups can run on a single compute unit if they can fit in it.
So if a workgroup has 128 workitems.A CU can execute two such workgroups if other resources permit.The implementation always tries to allocate 4 workitems to each stream core to hide latencies.
Thanks for replying. Yes, many work-groups can reside on one CU. But the question is about wavefronts, since each wavefront occupies the CU for 4 cycles.
Can more than 1 work-group fill a wavefront?
Example:
Work-group size 64: all 4 cycles of the wavefront are filled.
Work-group size 128: all 8 cycles of 2 wavefronts are filled.
Work-group size 32: 2 cycles of the wavefront are unused? or 2 work-groups in the wavefront?
Work-group size 1: the CU is occupied 4 cycles but only 1 stream core does work for 1 cycle? or 64 work-groups in the wavefront?
If only the same work-group can fill a wavefront, all implementations with work-group size <64 are unable to fully utilize the hardware. Or do I've got it wrong?
You are right.
CU are not utilized completely if the work group size<64.