The docs say DMA and Kernel executation can run asynchronously. While I can queue them to run and then do things with the CPU, I'm observing the total time to execute is the sum of the two times, rather than the max. Therefore in my case they are not running in parallel. But this should be possible, no?
I understand from the docs that it might be necessary to have a second context for the DMA (I have tried with and without) for its own command queue, but that did not help. The memory resource to be MemCopied is bound as a memory object in both contexts though not bound to the kernel at the time of the MemCopy (I'm using a double buffering system) ... might this prevent parallel execution? I do this because Mem Objects must be preallocated due to high allocation cost in the loop.
Does anyone have DMA and kernel execution running in parallel?
I guess after Catalyst 9.9 it was simply disabled at driver level. I had a lot of problems with calMemCopy before 9.9 -- it locks up frequently if CCC panel was opened at the same time as my program runs. After 9.9 there no lock ups but looks like it was done simply by forcing calMemCopy waits for kernel completion.
Tbh, there are many problems with asynchronous function calls with CAL. I've just faced that (undocumented) calCtxWaitForEvent blocks every CAL context currently active not just the one it takes as argument. Same situation with calResMap() -- it blocks/waits for any kernel completion even if you're trying to map memory on one device while kernel running on another (having 2+ GPUs at system).
I assume you are using threads. Have you tried MPI processes? I'm using python multiprocessing ... perhaps if the calMemCopy call gets its own context and its own process?
AMD/ATI, could you lets us know what does your test suite say? 😉
Originally posted by: emuller I assume you are using threads. Have you tried MPI processes? I'm using python multiprocessing ... perhaps if the calMemCopy call gets its own context and its own process?
Yeah, I'm using multi-threaded win32 application, creating separate thread per each GPU (and separate context too ofc). It's OK to run several processes, they have no problems to asynchronously works with different GPUs... which makes me wonder, why ATI failed to implement it within single process.
For me workaround was to use pinned memory, as my algorithms doesn't requires massive memory transfers -- it solves everything.
Anyway, for calMemCopy we're need to be within single process with kernel invocation routine itself, otherwise we cannot access memory transferred in another GPU context at all, and what's the point to transfer it "nowhere"? ...Or I'm missed something