Archives Discussions

omion · ‎01-02-2011

I heard that SDK 2.3 supports DMA transfers, which I gather will allow data to be transferred to/from the GPU while it is working on a kernel. (right?)

I am now trying to see if it is working, and nothing that I do will show it.

[edit: I have an HD 5850, running with Catalyst 10.12 and of course Stream SDK 2.3]

My basic idea is to have 2 buffers (buf1,buf2) with a kernel for each (kern1,kern2) and do a loop with something like this:
...
EnqueueWrite(buf1)
EnqueueKernel(kern2)
EnqueueWrite(buf2)
EnqueueKernel(kern1)
...

this way buf1 can be transferred while kern2 is running and buf2 can be transferred while kern1 is running. However, if I do the writes synchronously and wait for each kernel to finish right after it starts, it takes almost the exact same amount of time as async transfer and no waiting.

The stream timeline in the Stream Profiler shows the GPU switching between transfer and execution, with no overlap.

The exact program is a bit hairy, but I attached the actual execution part of it. (In the real code, I check the return for everything. LEN is the length of each buffer) Am I doing anything wrong?

clEnqueueWriteBuffer(q, buf2, CL_FALSE, 0, LEN * sizeof(cl_float), ptr2, 0, NULL, &e2write); clEnqueueWriteBuffer(q, buf1, CL_FALSE, 0, LEN * sizeof(cl_float), ptr1, 0, NULL, &e1write); clEnqueueNDRangeKernel(q, kern2, sizeof(global_dims) / sizeof(size_t), NULL, global_dims, local_dims, 1, &e2write, &e2kern); for(i = 0; i < 32; i++) { clEnqueueWriteBuffer(q, buf2, CL_FALSE, 0, LEN * sizeof(cl_float), ptr2, 1, &e2kern, &e2write); clEnqueueNDRangeKernel(q, kern1, sizeof(global_dims) / sizeof(size_t), NULL, global_dims, local_dims, 1, &e1write, &e1kern); clEnqueueWriteBuffer(q, buf1, CL_FALSE, 0, LEN * sizeof(cl_float), ptr1, 1, &e1kern, &e1write); clEnqueueNDRangeKernel(q, kern2, sizeof(global_dims) / sizeof(size_t), NULL, global_dims, local_dims, 1, &e2write, &e2kern); } clWaitForEvents(1, &e1kern); clWaitForEvents(1, &e2kern);

nou · ‎01-02-2011

you must enable out of order queue. which is currently unsupported. but you can try use two command queues. but even then it can be serialized.

Meteorhead · ‎01-02-2011

Do I take it right, that DMA in it's current scope is only useful for avoiding OS interaction into data movement, but it cannot be used to paralellize computation and data copying?

By unsupported do you mean that it works, but not officially, or it doesn't work at all? I know that devices do not support ouf-of-order exec, but I thought that is only about putting multiple kernels and buffer movement into the queue, and it picks workload depending on HW availability and event dependency inside the queue.

nou · ‎01-02-2011

no when is queue not out of order then it must be a in order . that mean all commands enqueued into queue is executed in order. so when you enqueue A,B,C then it will execute A,B,C and without overlap and A end before B start and B end before C start.

so for parralel execution in one queue you need out-of-order queue. it is not stated explicit in specification but it is there.

that is why i write try use two command queues. but even then AMD implementations can serialize execution of command.

omion · ‎01-02-2011

nou, thanks for the quick reply. You are exactly right with out-of-order queues. I read the spec again and it says that anything enqueued after a transfer can expect the transfer to have completed, except when it's out-of-order. (I originally though they had to be started in order but may overlap. but that's wrong)

I tried using 2 queues, but that still doesn't happen asynchronously (it seems to do one queue completely, then switch over to the other). I even used 2 contexts with one queue in each, but that did the same thing.

I haven't tried using 2 completely separate host threads to handle the queues, but it looks like simultaneous transfer/execute is just not supported yet...

Oh well. Guess I'll hope for support in 2.4. (It'd be nice to hear from AMD whether this functionality is even on the roadmap)

Rom1 · ‎01-03-2011

Currently the only way to hide transfer buffer time is to insert usefull CPU Computation

EnqueueWrite(buf1)

CPU Work
EnqueueKernel(kern2)

EnqueueWrite(buf2)

CPU Work

EnqueueKernel(kern1)

I would really appreciate an other way to hide transfer time in another way ? device fission ?

zeland · ‎01-03-2011

You should try to set GPU_USE_SYNC_OBJECTS=1 or another not described hint. May be it help you.

omion · ‎01-04-2011

@zeland:
I just tried GPU_USE_SYNC_OBJECTS=1, with no luck. None of the other undocumented environment settings looked applicable, either.

@Rom1:
I have been able to get the CPU to do things at the same time, but the GPU side is not so cooperative. My plan was to have my program do this:
step 1: GPU runs kernel on buf1, CPU updates host copy of buf1, buf2 is sent to GPU
step 2: GPU runs kernel on buf2, CPU updates host copy of buf2, buf1 is sent to GPU
repeat...

All 3 things listed in each step should be able to be done simultaneously, but it is currently making the GPU wait for the next buffer to be transferred before it will start crunching numbers again.

Meteorhead · ‎01-04-2011

Could you copy-paste the other, non-documented env variables you found unuseful?

omion · ‎01-04-2011

I looked through the list that nou posted on this thread. The only other one I tried from that list was GPU_ZERO_COPY_ENABLE=1 (no idea if the "=1" part is even right..) All the other options didn't look like they'd help.

omion · ‎01-06-2011

I just found the CAL documentation, which says specifically that DMA transfers can execute asynchronously from the kernel execution. From section 4.2.3:

This DMA unit can run asynchronously from the rest of the stream processor, allowing parallel data transfers when the SIMD engine is busy running a previous stream kernel.

So I know the GPU and CAL support it, but for some reason OpenCL doesn't.

So now a direct question to AMD: will this be supported in OpenCL?

jeff_golds · ‎01-06-2011

Originally posted by: omion I just found the CAL documentation, which says specifically that DMA transfers can execute asynchronously from the kernel execution. From section 4.2.3:
This DMA unit can run asynchronously from the rest of the stream processor, allowing parallel data transfers when the SIMD engine is busy running a previous stream kernel.
So I know the GPU and CAL support it, but for some reason OpenCL doesn't. So now a direct question to AMD: will this be supported in OpenCL?

It's being worked on. There are restrictions on what sorts of transfers can be handled with DMA so not all transfer can be done asynchronously.

Jeff

oscarbarenys1 · ‎01-06-2011

So,

how is the state of dual dma engine in cayman exposure.. it's gonna work at same time as single dma transfers work?

also some programmer info/guidance on when can we expect dma engines can be put to work would be ok..

thanks.

Archives Discussions

Trying to get asynchronous transfer from SDK 2.3