I understand that within one context the order of kernel completion is based on a Fifo scheme, ie the same order as request filed in the command queue.
Considering 2 contexts on the same device:
(a) Assuming for context 1, a batch of kernel computations has been launched through filing up queue 1 with call for kernels k1, k2 and k3 and then issuing a flush.
Assuming this task T1 (= k1+k2+k3) execution will take say 1s to complete
(b) Concurrently I have another task (T2) comprising kernel k4 that typically would take say 1ms of execution, and I would like this task to be completed quickly (ie no wait for T1 completion).
(1) is there a way while T1 is getting executed to preempt the GPU, by sending a call for T2 in context 2?
(2) in this case is it possible to assume that T2 could be completed prior to T1?
I tend to think some GPU preemption mechanism should exist, but don't know if this is the way to do it!
Unfortunately, there is no premption mechanism.
The closest that you can get is to make break down your task T1 into many smaller sub-tasks (say, by breaking up your domain of execution), and launch those sub-tasks at appropriate intervals. Now you can insert T2 in between as needed.
Now, the GPU relies on a large domain of execution (many many threads) to achieve its speedups, so at finer levels of granularity, there is a trade-off between timing precision and speedup.
Why do you need preemption, btw?
Well the reason I'm interested in GPU premption is for CPU-GPU co processing optimization. Consider this loop :
(1) I have a main GPU task T1 that works on a data sample N.
(2) In the meantime, I have the CPU working on the previous result of GPU computation ie sample N-1.
Doing this way I can take full advantage of GPU and CPU working concurrently with only one synchro point - ie at the beginning of each loop.
Now the point is that while CPU is doing its own work, it may take advantage of short tasks T2 being run on the GPU - the processing involved for T2 being more suitabloe for GPU than for CPU.
The issue is that T1 is a batch and cannot really be split into pieces, should this be the case, I'd lose most of the benefits of concurrent GPU-CPU execution (since synchro schemes would have to be inserted).
So this is why having a form of preemption of GPU (on a different context for example) would be highly wishable.