Archives Discussions

eklund_n · ‎09-28-2010

Hi.

Another post from me. Hopefully someone can give me answers.

In the ATI_Stream_SDK_OpenCL_Programming_Guide.pdf it states on page 15-16 that each Compute Unit has 16 Stream Cores with 5 Processing Elements each, i.e. 80 processing elements/compute unit. But on page 64 it states that it is 16 processing elements/compute unit. What part of the ATI Stream architecture is considered a OpenCL Processing Element?

Is there, or will there be, a tool like Stream Kernel Analyser for Linux?

Kind regards.

himanshu_gautam · ‎09-28-2010

hi eklund.n,

I was not able to find 16 processing elemets/CU on pg-64.Can you please specify the correct page.

As per your confusion.Each compute unit contains 16 stream cores.One single wavefront runs on a single core and hence at any particular time only a quad-wavefront(16 work items run).

Each Stream core contains 5 VLIW processing elements which can execute 4general +1 trancendental independent instruction from the same workitem.

eklund_n · ‎09-29-2010

First and second row: "The GPU consists of multiple compute units. Each compute unit contains 32 kB local (on-chip) memory, L1 cache, registers, and 16 processing elements. "

But after reading more I get how processing elements, stream cores and wavefronts are connected. It's just that the programming guide is a bit contradicting.

Now I have a question regarding wavefronts. If I have work-group size 128 I'll fill 2 whole wavefronts and utilize the device fully. But if I have work-group size 32, will 2 work-groups fill one wavefront? Or will the wavefront be half utilized?

If the second answer is 'yes' than that means that all work-group sizes smaller than 64 will result in bad usage of the device. Some algorithms, like AES, has highest suitable work-group size 16. Are these unsuitable by design to run on ATI cards with that big wavefront size?

Many thoughts, hopefully you got some answers to them. Thanks.

himanshu_gautam · ‎09-30-2010

hi eklund.n

Actually in openCL spec they call an entity runnint a workitem as processing element.But AMD call it stream core.So it becomes confusing sometimes.

Actually many workgroups can run on a single compute unit if they can fit in it.

So if a workgroup has 128 workitems.A CU can execute two such workgroups if other resources permit.The implementation always tries to allocate 4 workitems to each stream core to hide latencies.

eklund_n · ‎09-30-2010

Thanks for replying. Yes, many work-groups can reside on one CU. But the question is about wavefronts, since each wavefront occupies the CU for 4 cycles.

Can more than 1 work-group fill a wavefront?

Example:

Work-group size 64: all 4 cycles of the wavefront are filled.

Work-group size 128: all 8 cycles of 2 wavefronts are filled.

Work-group size 32: 2 cycles of the wavefront are unused? or 2 work-groups in the wavefront?

Work-group size 1: the CU is occupied 4 cycles but only 1 stream core does work for 1 cycle? or 64 work-groups in the wavefront?

If only the same work-group can fill a wavefront, all implementations with work-group size <64 are unable to fully utilize the hardware. Or do I've got it wrong?

himanshu_gautam · ‎10-01-2010

You are right.

CU are not utilized completely if the work group size<64.

Archives Discussions

Programming guide and kernel analyzer