2 Replies Latest reply on Apr 1, 2013 2:01 AM by himanshu.gautam

    help with work-group/item and wavefront

    via0517

      Hi

      I am a beginner with OpenCL and GPU, and have some questions:

      1. what is the connection between different work-groups, are they executed one by one or sometimes synchronous?

      2. I know a wavefront have 32 or 64 work-items, what is essentially wavefront? a wavefront execute these work-items always synchronous?

      3. here is a example about time complexity: I has 10 work-groups, each with 32 work-items, every work-item runs in O(n), what is total time complexity?

      is O(10*n) or still O(n)? what about every group has 256 work-items?

       

      Thanks

        • help with work-group/item and wavefront
          nou

          1. both. each compute unit get one or more work-group and it execute in parallel. if is there more work-groups than compute items can holds they are executed sequential.

          2. wavefront is HW equivalent of thread from CPU. that mean workitems can diverge at wavefron size level.

          3. in big O notation constant are most of the time omitted. that mean. if you want calculate time complexity on real HW best approach is calculate time complexity in sequential computation and then divide with achievable parallelism factor.

          • Re: help with work-group/item and wavefront
            himanshu.gautam

            1. Different workgroups are independent of each other. Software cannot synchronize multiple work-groups.

                Workgroups are usually executed in parallel by multiple compute units.

                A workgroup is always tied to a CU from its birth to death.

                A single CU can executed more workgroups provided the resource constraints allow (register, local memory, + hardware enforced limts etc..)

                The total number of workgroups tht simultaneously execute on a CU determine what is called "Occupancy".

                Lesser the occupancy -- Worse is your performance (because you cannot hide memory access latencies)

                Higher the occupancy -- No performance guarantee (because bottlenecks could be elsewhere - say wavefront divergence)

             

            2. Wavefront is a group of 64 workitems on AMD platform. Say ,  workgroup has 256 workitems. They are arranged as 4 wavefronts

               0--63, 64--127, 128--191, 192--255

            Workitems inside a wavefront are always executed together. They all execute the same instruction at sametime.

              Conditional Statements based on "get_local_id()" can cause workitems in a wavefront to diverge. This is a costly operation on hardware. Such statements inside FOR loops are detrimental to performance.

               Hardware makes sure that divergent workitems re-combine at the earliest available opportunity in the code.

             

            3. O(n) notation for parallel algorithms is what you should look for. Usually, as far as I know, the big-O notation is used only for sequential algorithm.