I use a multi-core CPU to run OPENCL programs.
I try to "hide latency" by setting clEnqueReadBuffer()'s 3rd param "blocking_read" to CL_FLASE. but it seems that it never really execute until the "clFinish()" is called.
So I wonder if only the kernels run parallelly?
I believe this was answered in an earlier thread somewhere. If I remember correctly Micah replied that the standard doesn't offer any guarantees of asynchronous behavior. What it does is guarantee synchronous behavior when blocking_read is true.(Which I took to mean that ATI doesn't do asynchronous copies at the moment)
You should be able to force an asynchronous copy if you do the read on another thread/command queue though.
From what I understand AMD's OpenCL implementation is basically lazy. If there is no guarantee that work will be done (like flush or finish commands, blocking commands) then the implementation will sit down and watch sports centre with a cool beer in hand, hoping that you'll forget about ever issuing the work in the first place .
At least that's been by experience