6 Replies Latest reply on Sep 2, 2011 4:41 PM by MicahVillmow

    Is it possible for the host to read/write a buffer written/read by the GPU kernel while the kernel is running?

    maelteir

      Hi all,

      I am trying to write a code that needs collaboration between the host and device. Specifically:

      1- The device sets a variable X to a non-zero value and waits for Y to be non-zero.

      2- The host reads X, when it becomes non-zero, it sets Y to non-zero value.

      Below is the code running on the GPU and CPU.

      This code hangs, so I assume that reading and writing by the host to variables manipulated by the GPU kernel is not possible while the kernel is running. Is that true?

       

      Thank you.

      -------------------- GPU kernel code -------------------- int tid = get_global_id(0); Y[tid]=0; X[tid]=tid; while (1) { if ( Y[tid] > 0 ) break; } ----------- CPU code ----------- for (int i=0; i< numThreads; i++) Y[i]=0; count=0; while (1) { clEnqueueReadBuffer(...,X,....) for (int i=0; i< numThreads; i++) { if ( Y[i] == 0) { if (X[i] > 0) { Y[i]=1; count++; } } if (count > 0 ) clEnqueueWriteBuffer(...,Y,....) if (count == numThreads) break; //All threads are done }

        • Is it possible for the host to read/write a buffer written/read by the GPU kernel while the kernel is running?
          notzed

          Not the way you've done it, and i doubt it's possible (at least, not portably), I'd also question if it were even desirable.

          The memory model allows that Y[tid] to be cached in a register for the whole run of the kernel.  Even with a global barrier() and persumably making it volatile, it could be cached in the local work-group multi-processor (MP) (my reading of section 3.3.1 of the spec).

          Atomic functions are multi-MP coherent, but I don't know if that extends to device memory writes from the host.

          Not sure about aync_workgroup_copy() either.

          Also queues by default (and all implementations?) are in-order, so you couldn't invoke a kernel and then a memory copy - the memory copy will (appear to) wait till the kernel is done before executing.  Although you could try multiple queues.

          The whole design looks very suspect to start with - why have 1 thread on the cpu perform per-thread dispatch decisions when there's  potentially thousands of threads on the other end that can make the decision without an extremely expensive synchronisation every iteration.