1. both. each compute unit get one or more work-group and it execute in parallel. if is there more work-groups than compute items can holds they are executed sequential.
2. wavefront is HW equivalent of thread from CPU. that mean workitems can diverge at wavefron size level.
3. in big O notation constant are most of the time omitted. that mean. if you want calculate time complexity on real HW best approach is calculate time complexity in sequential computation and then divide with achievable parallelism factor.
1. Different workgroups are independent of each other. Software cannot synchronize multiple work-groups.
Workgroups are usually executed in parallel by multiple compute units.
A workgroup is always tied to a CU from its birth to death.
A single CU can executed more workgroups provided the resource constraints allow (register, local memory, + hardware enforced limts etc..)
The total number of workgroups tht simultaneously execute on a CU determine what is called "Occupancy".
Lesser the occupancy -- Worse is your performance (because you cannot hide memory access latencies)
Higher the occupancy -- No performance guarantee (because bottlenecks could be elsewhere - say wavefront divergence)
2. Wavefront is a group of 64 workitems on AMD platform. Say , workgroup has 256 workitems. They are arranged as 4 wavefronts
0--63, 64--127, 128--191, 192--255
Workitems inside a wavefront are always executed together. They all execute the same instruction at sametime.
Conditional Statements based on "get_local_id()" can cause workitems in a wavefront to diverge. This is a costly operation on hardware. Such statements inside FOR loops are detrimental to performance.
Hardware makes sure that divergent workitems re-combine at the earliest available opportunity in the code.
3. O(n) notation for parallel algorithms is what you should look for. Usually, as far as I know, the big-O notation is used only for sequential algorithm.