Each hardware thread on the GPU is a 64-wide vector (a wavefront). So the GPU executes 64 work items on a given "cycle" (actually 8 clock cycles, but it's effectively indivisible). In addition the GPU can switch between wavefronts very rapidly and relies on having a large number of wavefronts active, maybe 4 or 8 on each SIMD core, to allow latency hiding. On the GPU I tend to advise that you use a work group size of 64 unless larger has really visible benefits in terms of increasing the number of work items that share data in LDS. A single 64-wide work item (with compiler hints) allows the compiler to remove barriers from the code which reduces synchronization overhead when no synchronization is really needed.
The CPU executes a single work item at a time. If you make a work group of more than one work item then at the end of the work item the runtime has to jump to the next. If you put a barrier in it is worse because when the CPU thread hits the barrier it has to store the work item state and jump to the next work item. There is little scope for cross-work-item optimisation and the barriers are fairly high overhead so optimally you want to only have one work item and manage looping manually.
This is why I don't like people using the word "thread" to describe a work item because it confuses the situation between the actual threads that execute and the work items executing on them.
How are work-groups mapped on CPU? One work-group -> one cpu thread?
One work group is more like a task in a thread pool API like TBB or ConcRT. There is a CPU thread pool of one thread per core. Those threads will pull work groups from the pool appropriately (the mapping may be fixed, I forget how the work distribution is implemented). So actually each thread created by the operating system processes a large number of work groups. Each CPU core processes one group at a time, though, yes.
After tried to execute a simple kernel (actually, a simple vector addition) for while using different global work size and local work size, I have another questions
I tried setting globalWorkSize = 1, localWorkSize =1, and the kernel execute faster in CPU compared to GPU. I know it not a good idea to execute with global work size=1 and local work size = 1 on GPU, just my curiosity.
1. What happened to GPU also CPU when we try to execute with global work size =1 and local work size =1.
2. Are there any relation of global work size and local work size to performance? I mean if I have a CPU with max work group size = 1024 and other CPU with max group size = 256. Will the first CPU be better?
Size of workgroup can greatly affect the performace. And it is surely expected that CPU would work faster if only one workitem is run.
GPU are not good for serial execution. To beat CPU you must have some parallel algorithm and large number of threads to properly utilize all the stream cores of GPU.Some of the reason which might clear things for you are:
CPU runs on much higher clock frequency(~ 3 GHZ) and also have large caches which reduce global access latency. But GPU has large number of cores which can execute instruction together(1600 PE for Cypress), wavefront scheduling is used to cover up the global access latencies as caches are not very effective in many cases.
Sorry about the delay, I've been wandering along the Cornish coast.
If you set the local size to 1 the GPU will execute one work item on a full wavefront. That means the hardware thread context that is capable of executing the instruction over 64 data items will in fact execute it over one only, repeating the execution the other 63 times masked out. It would be horrendously inefficient.
If you also set the global size to 1 you'll only have one wavefront active in the system when the GPU needs 4 or more per core and given 24 cores (6970) that means 96 waves to fill the system. That's a lot of wasted resources.
Of course, a global size of 1 is bad on the CPU too when you can now have up to 12 CPU cores only utilising one is poor for efficiency.
There is no relationship of max group size to performance that I can picture beyond the fact that a large group would let you use more of the cache if you insist on making each work item trivial in complexity. The max size for the CPU is purely a software limit. That for the GPU is based on the capabilities of the hardware schedulers but in most cases multiple groups covers latency as well if not better than a single large group.