"4 Compute Units/1 Group" is not accurate.
There are 10 compute units, each with 16 thread processors (or whatever they are calling them these days, it's too hard to keep track), each TP (thread processor) with a 5-wide VLIW processor.
The "groups" (I will call them wavefronts) are 64 threads. The threads are organized into 16 quads (2x2 threads), or at least last time I checked, lol.
Then there are two slots, odd and even, for wavefronts per compute unit (I will call them SIMD engines from here on out).
So, you have two wavefronts (one for each slot) running eight instructions over eight cycles on one SIMD engine organized into 16 quads (one quad per thread processor) which are organized into 2x2 threads.
So, 16*2*2 = 64 threads = one wavefront and you have two wavefronts running 8 instr over 8 cycles (so they say).
Hope this helps.
Also, "run at the same time" is a somewhat tricky terminology.
Technically, only two wavefronts run "at the same time" on a SIMD engine at ONE time; however, wavefronts are queued and scheduled based on resource usage (essentially GPRs used). This allows for wavefronts to be switched out with other wavefronts at the end of an ISA clause to better hide latency (for example, if the WF running is doing some fetching and the WF waiting will use the ALU units, etc... blahblahblah).
You can find A LOT of VERY useful information both on the ATI Stream forum and the AMD OpenCL forum (developer forums that is) simply by searching them. Try keywords like "wavefront", etc, etc...