1 Reply Latest reply on Mar 27, 2012 4:53 PM by notzed

    Simple Reductions algorithm clarification


      Hello, guys.


      I have been studying some parallel algorithms from the database here in the past couple of days.

      In one specific - The two-stage parallel reduction - there is something that is just slipping away from me.



      A preferred number of work-groups is given, but nothing is said about the number of work-items in a work-group. How many is the optimal or what is the logic?


      So just to be clear with my understanding of the execution model concerning this I have a couple of blitz questions:

      1) A Processing Element actually is the unit that executes instructions?

      2) A Compute unit is an artificial abstraction of the hardware?

      3) There cannot be more work-items in a work-group than are the number of Processing Elements on the device?


      A greatly appreciate any help in advance.


      Best regards,

      Dilyan Dokov.

        • Re: Simple Reductions algorithm clarification

          Compute unit and processing element are opencl-standard names, they are defined in the second chapter of the opencl specification.


          A PE is effectively an ALU: e.g. a thread, or a 'vector lane'.  The definition gives a little lee-way to provide for instruction-level parallelism e.g. using VLIW, but the distinction isn't terribly important for understanding parallel algorithms and you can treat it more or less as a single functional unit (e.g. in the amd doco they call this a stream-core).  (the PE number is about peak device throughput but has little bearing on parallel algorithms)


          A CU is the physical device which runs one (or more) work groups (concurrently).


          As for 3, no, there can be more work-items than processing elements: in-fact that this highly desirable and a big feature of all GPU designs.  By having multiple in-flight threads on the same CU, PE utilisation can be increased by hiding memory latencies.  It's like 'hyper-threading' on a much bigger scale.  Typically possible are 20-40 concurrent hardware threads.


          e.g. looking at the AMD APP programmng guide appendix D, and the HD 6950 which I have:

          it has 22 compute units, 22x16 'stream cores' (AMD specific terminology, each stream core executes a single thread, it takes 4 clock cycles to dispatch one instruction: hence 64 work items are required to fully populate the time-slice), and then 1408 PE's - because it has 4-way VLIW and each VLIW ALU is counted separately.


          Here i'm a little less certain: If you look at the global limits it allows for 23 concurrent wave-fronts per CU, and that number depends on the work-group size (and resource requirements).  A wavefront is 64 work items, so a 128-wide work-group would need 2 (I think).


          The reason the article might not mention the optimal work-group topology is because it depends on the hardware.  For AMD hardware, it should be a multiple of 64, and often 64 is a pretty good sweet spot as the work-item synchronisation (barrier) is much more efficient, and the lower resource requirements (local, registers etc) allows more work-groups to run concurrently on the same CU.


          Note that for algorithm choices, the numbers work out roughly the same on nvidia hardware or GCN hardware as well so the precise hardware details aren't (usually!) too important: everything's been optimised for the same types of work-load and the programming models are so similar.


          See also chapter 1.2 of the amd app programming guide.