I'm adding producer - consumer design into an OpenCL program for multiple GPUs and I managed to make it work at the host-device synchronization level. Now I need to add finer grained queue handling for commands and without adding any synchronization between host and device, only way I can think of is querying a device's command queue's remaining commands and finding the device with minimum commands remaining and assign a command to that device.
How can I know that how many commands are waiting to be processed in a command queue?
This is for OpenCL 1.2.
For now, I'm learning how to use callbacks with markers to count things. Does a firing callback halt its command queue? I wish not because I need finer grained control with less bubbles.
Trying something like this:
but I'm not sure if its the way to do it.
Now I implemented event callback based control logic, but it has more than 150 microseconds between kernels. Before this, there was only 2-3 microseconds between consecutive kernels(without sync on host, also without control logic). Pure synced(wait,finish) control logic has better load sharing between GPUs but it has the most gap between kernels like 300-400 microseconds. Is there a way to decrease overhead of callback on device side? Maybe decreasing number of callbacks per kernel (from 1 to 0.1 for example) can make it faster but this time performance-awareness would decrease.
I will test for multiple command-queues per device when I have time. Maybe this could hide the gap latency to 10-20 microseconds with 10+ queues.
Tested multiple queues, it drops to only about 100 microseconds and only few times. Could all the callbacks be serialized when going to host side even though they are on different command queues?
The latest shape of command queue class is this: