Is non-blocking enqueue possible in the current version of OpenCL?. I want to run a kernel simultaneously on two devices(CPU and GPU OR two GPUs). How do I do that?
It's more than possible, it's required. All enqueues are non-blocking.
To use multiple devices you'd have a queue for each. The important thing is to *flush* both queues before you wait on anything, because that ensures that the work will be pushed out to the device. Implementations are allowed to be very lazy with pushing work out (to build the biggest batches possible).
I have invoked 2 kernels, one on the GPU and another on CPU. How do I measure the performance of both kernels, I couldn't use the APP profiler.
Nono, quite the opposite. Out of order queues are not yet supported - each set of things you want to be able to execute without dependencies on each other need to be in a separate queue.
You need one queue per device, and you need to flush the queues for all devices before you wait on any queues. Waiting on one queue will block the calling thread and you will not get a chance to request that the second queue start executing. If you flush both then both can start in the background and only then do you wait on one or both.
Retrieving data ...