Hi there! I have a very simple task: scan over a char array multiple times (16*1024). I implemented it with pthread with one thread on one CPU core. The time is 23's. Then I use device fission to create a device containing only one CPU Compute Unit (i.e., one CPU core), the time is only 17's. In my opinion, the OpenCL implementation should be slower than pthread (because C is more hardware-close). How come I get this results?