Well, it isn't always true. There are lots of examples where the CPU is faster than the GPU. Anything that can get good cache utilisation, anything where there's a good serial optimisation that you can't reproduce in parallel code, anything that maps badly to SIMD execution.
I would avoid using a workgroup size bigger than 1 for the CPU (and you get full utilisation you have to pack into vectors manually, hopefully we'll have the compiler doing that at some point and you can use a workgroup size of 4), so the max workgroup size being larger is just an irrelevance.
Number of workgroups doesn't matter so much. Each one's a single CPU thread or up to four GPU threads. That maps fairly directly to the number of threads active on the machine. The groups are then rotated through as earlier ones finish whichever device you're using. The CPU will do that with a little more overhead but probably not too much.