1. Different workgroups are independent of each other. Software cannot synchronize multiple work-groups.
Workgroups are usually executed in parallel by multiple compute units.
A workgroup is always tied to a CU from its birth to death.
A single CU can executed more workgroups provided the resource constraints allow (register, local memory, + hardware enforced limts etc..)
The total number of workgroups tht simultaneously execute on a CU determine what is called "Occupancy".
Lesser the occupancy -- Worse is your performance (because you cannot hide memory access latencies)
Higher the occupancy -- No performance guarantee (because bottlenecks could be elsewhere - say wavefront divergence)
2. Wavefront is a group of 64 workitems on AMD platform. Say , workgroup has 256 workitems. They are arranged as 4 wavefronts
0--63, 64--127, 128--191, 192--255
Workitems inside a wavefront are always executed together. They all execute the same instruction at sametime.
Conditional Statements based on "get_local_id()" can cause workitems in a wavefront to diverge. This is a costly operation on hardware. Such statements inside FOR loops are detrimental to performance.
Hardware makes sure that divergent workitems re-combine at the earliest available opportunity in the code.
3. O(n) notation for parallel algorithms is what you should look for. Usually, as far as I know, the big-O notation is used only for sequential algorithm.