Why CPU is always slower than GPU despite it has more group work size and work item size? In my case the global work size is the same for GPU and CPU, but the local work size is larger for CPU to maximize the CPU utilization. Any explaination about this?
For example if the number of compute units of CPU and GPU is the same, but CPU has larger max work group size. Is it still GPU is faster? Unfortunately I do not have 8 cores or 16 cores CPU, so I can not try.
Well, it isn't always true. There are lots of examples where the CPU is faster than the GPU. Anything that can get good cache utilisation, anything where there's a good serial optimisation that you can't reproduce in parallel code, anything that maps badly to SIMD execution.
I would avoid using a workgroup size bigger than 1 for the CPU (and you get full utilisation you have to pack into vectors manually, hopefully we'll have the compiler doing that at some point and you can use a workgroup size of 4), so the max workgroup size being larger is just an irrelevance.
Number of workgroups doesn't matter so much. Each one's a single CPU thread or up to four GPU threads. That maps fairly directly to the number of threads active on the machine. The groups are then rotated through as earlier ones finish whichever device you're using. The CPU will do that with a little more overhead but probably not too much.