Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

CPU is always slower?


Another question:

Why CPU is always slower than GPU despite it has more group work size and work item size? In my case the global work size is the same for GPU and CPU, but the local work size is larger for CPU to maximize the CPU utilization. Any explaination about this? 

For example if the number of compute units of CPU and GPU is the same, but CPU has larger max work group size. Is it still GPU is faster? Unfortunately I do not have 8 cores or 16 cores CPU, so I can not try.


2 Replies
Journeyman III

The parallel performance of GPU is one order higher than CPU in common - the preformance of high-end card is above 1 TFLOP while the the CPU's is ca. tens of GFLOPS.
The reason is that compute unit consists of tens or hunderds of stream cores (and even a stream core consists of 4/5 ALUs), while one CPU core can process at most 4 values at the same time (SSE vector instructions).
That's why we offload the work to the GPU - because is faster.

Max work group size isn't relevant - it roughly says how many work-items (threads) can be scheduled as a unit, but has no relation to performance (On CPU core, the work-items are serialized)

Well, it isn't always true. There are lots of examples where the CPU is faster than the GPU. Anything that can get good cache utilisation, anything where there's a good serial optimisation that you can't reproduce in parallel code, anything that maps badly to SIMD execution.

I would avoid using a workgroup size bigger than 1 for the CPU (and you get full utilisation you have to pack into vectors manually, hopefully we'll have the compiler doing that at some point and you can use a workgroup size of 4), so the max workgroup size being larger is just an irrelevance.

Number of workgroups doesn't matter so much. Each one's a single CPU thread or up to four GPU threads. That maps fairly directly to the number of threads active on the machine. The groups are then rotated through as earlier ones finish whichever device you're using. The CPU will do that with a little more overhead but probably not too much.