CalMemCpy + RunGridArray not running in parallel

The docs say DMA and Kernel executation can run asynchronously.  While I can queue them to run and then do things with the CPU, I'm observing the total time to execute is the sum of the two times, rather than the max.  Therefore in my case they are not running in parallel.  But this should be possible, no?

I understand from the docs that it might be necessary to have a second context for the DMA (I have tried with and without) for its own command queue, but that did not help.  The memory resource to be MemCopied is bound as a memory object in both contexts though not bound to the kernel at the time of the MemCopy (I'm using a double buffering system) ... might this prevent parallel execution?  I do this because Mem Objects must be preallocated due to high allocation cost in the loop.

Does anyone have DMA and kernel execution running in parallel?