my current implementation is that create 2 queue for a single GPU device,
one queue is only for memory transfer API such as:
clEnqueueReadBuffer, clEnqueueWriteBuffer or clEnqueueCopyBuffer.
another queue is only for GPU computing API such as clEnqueueNDRangeKernel
and I sync these 2 queues using shared cl_event objects when it is necessary.
But for my test, this can not make transfer and computing concurrent. does it mean that clEnqueueCopyBuffer and clEnqueueNDRangeKernel will execute serially even they are on different queues?