I have been studying some parallel algorithms from the database here in the past couple of days.
In one specific - The two-stage parallel reduction - there is something that is just slipping away from me.
A preferred number of work-groups is given, but nothing is said about the number of work-items in a work-group. How many is the optimal or what is the logic?
So just to be clear with my understanding of the execution model concerning this I have a couple of blitz questions:
1) A Processing Element actually is the unit that executes instructions?
2) A Compute unit is an artificial abstraction of the hardware?
3) There cannot be more work-items in a work-group than are the number of Processing Elements on the device?
A greatly appreciate any help in advance.