Create multiple contexts on the same device, those are overlap well (at least on HD 5..,6..,7970).
Some months ago I've tried that EXEC_OUT_OF_ORDER flag, but it did nothing, everything in a single context was sequential.
I tried the solution you suggested but its still not working.
Now I have compute_context and data_context. My clCreateBuffers() are called in data_context
With two contexts, I am seeing a strange problem. If I register for a event call back with clEnqueueRead/Write() (in data context) then the clSetEventCallBack() is no longer getting called back.
The events that I register for callback as part of clEnqueueNDRangeKernel() (compute_context) in the compute_contexts are getting called back.
Any idea why the event callback is not getting called back from data context?
I guess you try to share data between kernels and OpenCL synchronizes.
My suggestion worked in the following scenario:
There was a kernel which ran for 0.5 seconds, and it has a small amount of upload at the start and download at the beginning (few megabytes only).
So I made 2 kernels on 2 contexts and the data also was allocated no the 2 contexts (no interference). (it needs twice the paralellism unfortunately)
I had a 20millisec timer function that was polled the two kernels and controlled them: When one of the kernels was about to finish it launched the other one (on the other context).
This way with 2 totally independent contexts I've got +2-3% speedup, and got 99% ALU utilization.
If your kernel using RAM more frequently, then it's a lot harder situation