Compute unit and processing element are opencl-standard names, they are defined in the second chapter of the opencl specification.
A PE is effectively an ALU: e.g. a thread, or a 'vector lane'. The definition gives a little lee-way to provide for instruction-level parallelism e.g. using VLIW, but the distinction isn't terribly important for understanding parallel algorithms and you can treat it more or less as a single functional unit (e.g. in the amd doco they call this a stream-core). (the PE number is about peak device throughput but has little bearing on parallel algorithms)
A CU is the physical device which runs one (or more) work groups (concurrently).
As for 3, no, there can be more work-items than processing elements: in-fact that this highly desirable and a big feature of all GPU designs. By having multiple in-flight threads on the same CU, PE utilisation can be increased by hiding memory latencies. It's like 'hyper-threading' on a much bigger scale. Typically possible are 20-40 concurrent hardware threads.
e.g. looking at the AMD APP programmng guide appendix D, and the HD 6950 which I have:
it has 22 compute units, 22x16 'stream cores' (AMD specific terminology, each stream core executes a single thread, it takes 4 clock cycles to dispatch one instruction: hence 64 work items are required to fully populate the time-slice), and then 1408 PE's - because it has 4-way VLIW and each VLIW ALU is counted separately.
Here i'm a little less certain: If you look at the global limits it allows for 23 concurrent wave-fronts per CU, and that number depends on the work-group size (and resource requirements). A wavefront is 64 work items, so a 128-wide work-group would need 2 (I think).
The reason the article might not mention the optimal work-group topology is because it depends on the hardware. For AMD hardware, it should be a multiple of 64, and often 64 is a pretty good sweet spot as the work-item synchronisation (barrier) is much more efficient, and the lower resource requirements (local, registers etc) allows more work-groups to run concurrently on the same CU.
Note that for algorithm choices, the numbers work out roughly the same on nvidia hardware or GCN hardware as well so the precise hardware details aren't (usually!) too important: everything's been optimised for the same types of work-load and the programming models are so similar.
See also chapter 1.2 of the amd app programming guide.