Hi there! I have a very simple task: scan over a char array multiple times (16*1024). I implemented it with pthread with one thread on one CPU core. The time is 23's. Then I use device fission to create a device containing only one CPU Compute Unit (i.e., one CPU core), the time is only 17's. In my opinion, the OpenCL implementation should be slower than pthread (because C is more hardware-close). How come I get this results?
Your assumption is incorrect.
C is only close to your hardware if you use intrinsics correctly, otherwise, the compiler will take advantage of the highest compile target you allow it, which might be lower level.
Even if you allow it to match your specific hardware, there's no guarantee it will be able to emit appropriate optimal code due to historical reasons (unaligned memory, sharing semantics)...
The OpenCL C kernel runs in a somewhat "protected" model where stronger assumptions can be made. In that case we're likely talking about aligned memory access. A lot of sources of inefficiency are removed and even though the compiler must work in a much shorter time it still has a better chance of optimizing than the fully generic offline compiler.
FYI, what you have observed happens even on more involved workloads and on a variety of processors. That's a clear demonstration of the programming model superiority.
not necessary true. OpenCL can end up more cache friendly or other factor why it end up faster naive implementation.