5 Replies Latest reply on Oct 1, 2010 6:24 AM by himanshu.gautam

    Programming guide and kernel analyzer

    eklund.n

      Hi.

      Another post from me. Hopefully someone can give me answers.

      In the ATI_Stream_SDK_OpenCL_Programming_Guide.pdf it states on page 15-16 that each Compute Unit has 16 Stream Cores with 5 Processing Elements each, i.e. 80 processing elements/compute unit. But on page 64 it states that it is 16 processing elements/compute unit. What part of the ATI Stream architecture is considered a OpenCL Processing Element?

      Is there, or will there be, a tool like Stream Kernel Analyser for Linux?

      Kind regards.

        • Programming guide and kernel analyzer
          himanshu.gautam

          hi eklund.n,

          I was not able to find 16 processing elemets/CU on pg-64.Can you please specify the correct page.

          As per your confusion.Each compute unit contains 16 stream cores.One single wavefront runs on a single core and hence at any particular time only a quad-wavefront(16 work items run).

          Each Stream core contains 5 VLIW processing elements which can execute 4general +1 trancendental independent instruction from the same workitem.

            • Programming guide and kernel analyzer
              eklund.n

              First and second row: "The GPU consists of multiple compute units. Each compute unit contains 32 kB local (on-chip) memory, L1 cache, registers, and 16 processing elements. "

              But after reading more I get how processing elements, stream cores and wavefronts are connected. It's just that the programming guide is a bit contradicting.
              Now I have a question regarding wavefronts. If I have work-group size 128 I'll fill 2 whole wavefronts and utilize the device fully. But if I have work-group size 32, will 2 work-groups fill one wavefront? Or will the wavefront be half utilized?
              If the second answer is 'yes' than that means that all work-group sizes smaller than 64 will result in bad usage of the device. Some algorithms, like AES, has highest suitable work-group size 16. Are these unsuitable by design to run on ATI cards with that big wavefront size?
              Many thoughts, hopefully you got some answers to them. Thanks.
                • Programming guide and kernel analyzer
                  himanshu.gautam

                  hi eklund.n

                  Actually in openCL spec they call an entity runnint a workitem as processing element.But AMD call it stream core.So it becomes confusing sometimes.

                  Actually many workgroups can run on a single compute unit if they can fit in it.

                  So if a workgroup has 128 workitems.A CU can execute two such workgroups if other resources permit.The implementation always tries to allocate 4 workitems to each stream core to hide latencies.

                    • Programming guide and kernel analyzer
                      eklund.n

                      Thanks for replying. Yes, many work-groups can reside on one CU. But the question is about wavefronts, since each wavefront occupies the CU for 4 cycles.

                       

                      Can more than 1 work-group fill a wavefront?

                       

                      Example:

                      Work-group size 64: all 4 cycles of the wavefront are filled.

                      Work-group size 128: all 8 cycles of 2 wavefronts are filled.

                      Work-group size 32: 2 cycles of the wavefront are unused? or 2 work-groups in the wavefront?

                      Work-group size 1: the CU is occupied 4 cycles but only 1 stream core does work for 1 cycle? or 64 work-groups in the wavefront?

                       

                      If only the same work-group can fill a wavefront, all implementations with work-group size <64 are unable to fully utilize the hardware. Or do I've got it wrong?