clEnqueueWriteBuffer(q, buf2, CL_FALSE, 0, LEN * sizeof(cl_float), ptr2, 0, NULL, &e2write); clEnqueueWriteBuffer(q, buf1, CL_FALSE, 0, LEN * sizeof(cl_float), ptr1, 0, NULL, &e1write); clEnqueueNDRangeKernel(q, kern2, sizeof(global_dims) / sizeof(size_t), NULL, global_dims, local_dims, 1, &e2write, &e2kern); for(i = 0; i < 32; i++) { clEnqueueWriteBuffer(q, buf2, CL_FALSE, 0, LEN * sizeof(cl_float), ptr2, 1, &e2kern, &e2write); clEnqueueNDRangeKernel(q, kern1, sizeof(global_dims) / sizeof(size_t), NULL, global_dims, local_dims, 1, &e1write, &e1kern); clEnqueueWriteBuffer(q, buf1, CL_FALSE, 0, LEN * sizeof(cl_float), ptr1, 1, &e1kern, &e1write); clEnqueueNDRangeKernel(q, kern2, sizeof(global_dims) / sizeof(size_t), NULL, global_dims, local_dims, 1, &e2write, &e2kern); } clWaitForEvents(1, &e1kern); clWaitForEvents(1, &e2kern);
you must enable out of order queue. which is currently unsupported. but you can try use two command queues. but even then it can be serialized.
Do I take it right, that DMA in it's current scope is only useful for avoiding OS interaction into data movement, but it cannot be used to paralellize computation and data copying?
By unsupported do you mean that it works, but not officially, or it doesn't work at all? I know that devices do not support ouf-of-order exec, but I thought that is only about putting multiple kernels and buffer movement into the queue, and it picks workload depending on HW availability and event dependency inside the queue.
no when is queue not out of order then it must be a in order . that mean all commands enqueued into queue is executed in order. so when you enqueue A,B,C then it will execute A,B,C and without overlap and A end before B start and B end before C start.
so for parralel execution in one queue you need out-of-order queue. it is not stated explicit in specification but it is there.
that is why i write try use two command queues. but even then AMD implementations can serialize execution of command.
Currently the only way to hide transfer buffer time is to insert usefull CPU Computation
EnqueueWrite(buf1)
CPU Work
EnqueueKernel(kern2)
EnqueueWrite(buf2)
CPU Work
EnqueueKernel(kern1)
I would really appreciate an other way to hide transfer time in another way ? device fission ?
You should try to set GPU_USE_SYNC_OBJECTS=1 or another not described hint. May be it help you.
Could you copy-paste the other, non-documented env variables you found unuseful?
So I know the GPU and CAL support it, but for some reason OpenCL doesn't.
This DMA unit can run asynchronously from the rest of the stream processor, allowing parallel data transfers when the SIMD engine is busy running a previous stream kernel.
Originally posted by: omion I just found the CAL documentation, which says specifically that DMA transfers can execute asynchronously from the kernel execution. From section 4.2.3: This DMA unit can run asynchronously from the rest of the stream processor, allowing parallel data transfers when the SIMD engine is busy running a previous stream kernel.So I know the GPU and CAL support it, but for some reason OpenCL doesn't. So now a direct question to AMD: will this be supported in OpenCL?
It's being worked on. There are restrictions on what sorts of transfers can be handled with DMA so not all transfer can be done asynchronously.
Jeff
So,
how is the state of dual dma engine in cayman exposure.. it's gonna work at same time as single dma transfers work?
also some programmer info/guidance on when can we expect dma engines can be put to work would be ok..
thanks.