12 Replies Latest reply on Jan 6, 2011 8:03 PM by oscarbarenys1

    Trying to get asynchronous transfer from SDK 2.3

    omion
      I heard that SDK 2.3 supports DMA transfers, which I gather will allow data to be transferred to/from the GPU while it is working on a kernel. (right?)

      I am now trying to see if it is working, and nothing that I do will show it.

      [edit: I have an HD 5850, running with Catalyst 10.12 and of course Stream SDK 2.3]

      My basic idea is to have 2 buffers (buf1,buf2) with a kernel for each (kern1,kern2) and do a loop with something like this:
      ...
      EnqueueWrite(buf1)
      EnqueueKernel(kern2)
      EnqueueWrite(buf2)
      EnqueueKernel(kern1)
      ...

      this way buf1 can be transferred while kern2 is running and buf2 can be transferred while kern1 is running. However, if I do the writes synchronously and wait for each kernel to finish right after it starts, it takes almost the exact same amount of time as async transfer and no waiting.

      The stream timeline in the Stream Profiler shows the GPU switching between transfer and execution, with no overlap.

      The exact program is a bit hairy, but I attached the actual execution part of it. (In the real code, I check the return for everything. LEN is the length of each buffer) Am I doing anything wrong?

      clEnqueueWriteBuffer(q, buf2, CL_FALSE, 0, LEN * sizeof(cl_float), ptr2, 0, NULL, &e2write); clEnqueueWriteBuffer(q, buf1, CL_FALSE, 0, LEN * sizeof(cl_float), ptr1, 0, NULL, &e1write); clEnqueueNDRangeKernel(q, kern2, sizeof(global_dims) / sizeof(size_t), NULL, global_dims, local_dims, 1, &e2write, &e2kern); for(i = 0; i < 32; i++) { clEnqueueWriteBuffer(q, buf2, CL_FALSE, 0, LEN * sizeof(cl_float), ptr2, 1, &e2kern, &e2write); clEnqueueNDRangeKernel(q, kern1, sizeof(global_dims) / sizeof(size_t), NULL, global_dims, local_dims, 1, &e1write, &e1kern); clEnqueueWriteBuffer(q, buf1, CL_FALSE, 0, LEN * sizeof(cl_float), ptr1, 1, &e1kern, &e1write); clEnqueueNDRangeKernel(q, kern2, sizeof(global_dims) / sizeof(size_t), NULL, global_dims, local_dims, 1, &e2write, &e2kern); } clWaitForEvents(1, &e1kern); clWaitForEvents(1, &e2kern);

        • Trying to get asynchronous transfer from SDK 2.3
          nou

          you must enable out of order queue. which is currently unsupported. but you can try use two command queues. but even then it can be serialized.

            • Trying to get asynchronous transfer from SDK 2.3
              Meteorhead

              Do I take it right, that DMA in it's current scope is only useful for avoiding OS interaction into data movement, but it cannot be used to paralellize computation and data copying?

              By unsupported do you mean that it works, but not officially, or it doesn't work at all? I know that devices do not support ouf-of-order exec, but I thought that is only about putting multiple kernels and buffer movement into the queue, and it picks workload depending on HW availability and event dependency inside the queue.

                • Trying to get asynchronous transfer from SDK 2.3
                  nou

                  no when is queue not out of order then it must be a in order . that mean all commands enqueued into queue is executed in order. so when you enqueue A,B,C then it will execute A,B,C and without overlap and A end before B start and B end before C start.

                  so for parralel execution in one queue you need out-of-order queue. it is not stated explicit in specification but it is there.

                  that is why i write try use two command queues. but even then AMD implementations can serialize execution of command.

                    • Trying to get asynchronous transfer from SDK 2.3
                      omion
                      nou, thanks for the quick reply. You are exactly right with out-of-order queues. I read the spec again and it says that anything enqueued after a transfer can expect the transfer to have completed, except when it's out-of-order. (I originally though they had to be started in order but may overlap. but that's wrong)

                      I tried using 2 queues, but that still doesn't happen asynchronously (it seems to do one queue completely, then switch over to the other). I even used 2 contexts with one queue in each, but that did the same thing.

                      I haven't tried using 2 completely separate host threads to handle the queues, but it looks like simultaneous transfer/execute is just not supported yet...

                      Oh well. Guess I'll hope for support in 2.4. (It'd be nice to hear from AMD whether this functionality is even on the roadmap)
                        • Trying to get asynchronous transfer from SDK 2.3
                          Rom1

                          Currently the only way to hide transfer buffer time is to insert usefull CPU Computation

                          EnqueueWrite(buf1) 

                          CPU Work
                          EnqueueKernel(kern2)

                          EnqueueWrite(buf2) 

                          CPU Work

                          EnqueueKernel(kern1) 

                          I would really appreciate an other way to hide transfer time in another way ? device fission ?

                            • Trying to get asynchronous transfer from SDK 2.3
                              zeland

                              You should  try to set GPU_USE_SYNC_OBJECTS=1 or another not described hint. May be it help you.

                                • Trying to get asynchronous transfer from SDK 2.3
                                  omion
                                  @zeland:
                                  I just tried GPU_USE_SYNC_OBJECTS=1, with no luck. None of the other undocumented environment settings looked applicable, either.

                                  @Rom1:
                                  I have been able to get the CPU to do things at the same time, but the GPU side is not so cooperative. My plan was to have my program do this:
                                  step 1: GPU runs kernel on buf1, CPU updates host copy of buf1, buf2 is sent to GPU
                                  step 2: GPU runs kernel on buf2, CPU updates host copy of buf2, buf1 is sent to GPU
                                  repeat...

                                  All 3 things listed in each step should be able to be done simultaneously, but it is currently making the GPU wait for the next buffer to be transferred before it will start crunching numbers again.