Now I implemented event callback based control logic, but it has more than 150 microseconds between kernels. Before this, there was only 2-3 microseconds between consecutive kernels(without sync on host, also without control logic). Pure synced(wait,finish) control logic has better load sharing between GPUs but it has the most gap between kernels like 300-400 microseconds. Is there a way to decrease overhead of callback on device side? Maybe decreasing number of callbacks per kernel (from 1 to 0.1 for example) can make it faster but this time performance-awareness would decrease.
- no sync, no control logic = 2-3 microseconds (only works with identical GPUs)
- (I am looking something to fit this position)
- no sync, callback logic = 150-200 microseconds (asymmetric multi-GPUs are balanced with error of %2.5 of total kernels)
- sync = 300-400 microseconds (best balance, worst performance)
I will test for multiple command-queues per device when I have time. Maybe this could hide the gap latency to 10-20 microseconds with 10+ queues.
Tested multiple queues, it drops to only about 100 microseconds and only few times. Could all the callbacks be serialized when going to host side even though they are on different command queues?
The latest shape of command queue class is this: