I am running Open CL on Core i7 860 with ATI HD 5770. I can get a kernel to execute at 16us on GPU, but the CPU takes minimum 65us. Considering that the GPU is supposed to be going over the PCIe bus this makes no sense. Has anybody seen anything similar? The sleep(0) commad on Windows typically allows a thread wake time within 2us and the CPU code path for Open CL is thus unusually slow. This makes GPU look better than CPU even though threaded and vectorized code for CPU which would use OpenMP instead of Open CL would perform much better. I am running the generic a = b + c kernel example at various vector sizes and loop count.