I have a series of nd-ranges that run correctly with the 12.x drivers. Each nd-range has an event dependency on the previous nd-range. With 13.x drivers (13.1, 13.4, 13.6beta), the output is incorrect. The problem appears to occur after an nd-range composed of a single work-group is ran.
* If there is a clFinish after this nd-range, the output becomes correct (though this is not an option as I can't have a blocking call in the middle of the execution).
* Adding a clFlush after the nd-range is enqueued, which causes a small gap in execution before the next nd-range, causes the output to be correct.
* Adding a clEnqueueBarrierWithWaitList into the command queue does not help, and the results are still incorrect.
Since the nd-range is only composed of a single work-group, is it possible that the driver is scheduling the following nd-range on available compute units even though a dependency exists? I'm running with an in-order command queue and my GPU is a 7850.
I'd appreciate any thoughts or suggestions!