I'm new to OpenCL dev and i want to understand some mechanics.
I've a simple matrix multiplication kernel, and i want to see the impact of the blocking option for the clEnqueue* instructions.
So, i compute one time with blocking write&read and the other non blocking.
When I look to the profiling times of execution, I've, for the blocking version, a sequential order for each time (enqueue, submit, kernel start, kernel end) but in non blocking i got the execution of the kernel before that it's submitted and queued.
Can someone explain me this behaviour, thank you very much.