The parallel performance of GPU is one order higher than CPU in common - the preformance of high-end card is above 1 TFLOP while the the CPU's is ca. tens of GFLOPS.
The reason is that compute unit consists of tens or hunderds of stream cores (and even a stream core consists of 4/5 ALUs), while one CPU core can process at most 4 values at the same time (SSE vector instructions).
That's why we offload the work to the GPU - because is faster.
Max work group size isn't relevant - it roughly says how many work-items (threads) can be scheduled as a unit, but has no relation to performance (On CPU core, the work-items are serialized)
Well, it isn't always true. There are lots of examples where the CPU is faster than the GPU. Anything that can get good cache utilisation, anything where there's a good serial optimisation that you can't reproduce in parallel code, anything that maps badly to SIMD execution.
I would avoid using a workgroup size bigger than 1 for the CPU (and you get full utilisation you have to pack into vectors manually, hopefully we'll have the compiler doing that at some point and you can use a workgroup size of 4), so the max workgroup size being larger is just an irrelevance.
Number of workgroups doesn't matter so much. Each one's a single CPU thread or up to four GPU threads. That maps fairly directly to the number of threads active on the machine. The groups are then rotated through as earlier ones finish whichever device you're using. The CPU will do that with a little more overhead but probably not too much.