Hi all,
I am trying to write a code that needs collaboration between the host and device. Specifically:
1- The device sets a variable X to a non-zero value and waits for Y to be non-zero.
2- The host reads X, when it becomes non-zero, it sets Y to non-zero value.
Below is the code running on the GPU and CPU.
This code hangs, so I assume that reading and writing by the host to variables manipulated by the GPU kernel is not possible while the kernel is running. Is that true?
Thank you.
-------------------- GPU kernel code -------------------- int tid = get_global_id(0); Y[tid]=0; X[tid]=tid; while (1) { if ( Y[tid] > 0 ) break; } ----------- CPU code ----------- for (int i=0; i< numThreads; i++) Y=0; count=0; while (1) { clEnqueueReadBuffer(...,X,....) for (int i=0; i< numThreads; i++) { if ( Y == 0) { if (X > 0) { Y=1; count++; } } if (count > 0 ) clEnqueueWriteBuffer(...,Y,....) if (count == numThreads) break; //All threads are done }
Not the way you've done it, and i doubt it's possible (at least, not portably), I'd also question if it were even desirable.
The memory model allows that Y[tid] to be cached in a register for the whole run of the kernel. Even with a global barrier() and persumably making it volatile, it could be cached in the local work-group multi-processor (MP) (my reading of section 3.3.1 of the spec).
Atomic functions are multi-MP coherent, but I don't know if that extends to device memory writes from the host.
Not sure about aync_workgroup_copy() either.
Also queues by default (and all implementations?) are in-order, so you couldn't invoke a kernel and then a memory copy - the memory copy will (appear to) wait till the kernel is done before executing. Although you could try multiple queues.
The whole design looks very suspect to start with - why have 1 thread on the cpu perform per-thread dispatch decisions when there's potentially thousands of threads on the other end that can make the decision without an extremely expensive synchronisation every iteration.
Sadly, the AMD SDK currently doesn't support out of order processing OR parallel execution of multiple command queues on the GPU either, although I think parallel execution is supported on the CPU.
maeltier,
In my opinion , it is not posssible to check the value of some variable while the kernel is running.
Can you describe some scenario where this could be useful.
antzehere,
I don't think out-of-order command execution can solve this issue, as it is a opencl spec requirement.
Thank you all for your replies.
The code attached is a simplification of what I want to do, so It may look strange.
One case where this collaboration between CPU and GPU is required is when you want to support communication across GPUs. You can take a look on this paper that is implemented using CUDA and NVIDIA GPUs: http://academic.research.microsoft.com/Publication/4856677
By the way, I also tried using aync_workgroup_copy(), and two command queues for the kernel and memory transfers, but it didn't work.
Thank you
himanshu.gautam:
Call me crazy, but from the marketing stuff, I was led to believe that this sort of thing would be possible in future APU's, am I wrong? The benefits are huge!
In my perfect world, IL would be able to function basically as an extension to X86, much like SSE is currently. (Shared ram, shared cache, shared virtual address space, etc, but considered a seperate "coprocessor", much like x87 is also a "coprocessor" (I know, I had a 486SX...it used to actually be a physically seperate coprocessor).
This would mean I could have 1 core of an APU (with its own set of GPU type processors and x86 processor) producing results, and another core of APU's consuming them, perhaps with one of the CPU Cores in the middle doing something like byteswapping which, with SSSE3, its incredibly good at (PSHUFB).
Thats just off the top of my head. I don't think heterogeneous computing needs to mean high cost of overhead from switching from serial, or slightly parallel to massivly parallel and back.
I know, I just barely sign up and start making demands. Just take it as I'm very excited about the future possibilities of the APU design!
__kernel void volatile_test(volatile __global uint* input) { input[128]=1; // GPU received single from CPU too early if(input[0]==2) { input[0]=16; return; } for(uint i=0; i<8065975/40; i++) //roughly 0.25 seconds delay on Cayman { uint temp=input[0]; if(temp==2) { // GPU received signal from CPU at the correct point input[0]=8; return; } } //Detect GPU never recieved signal. input[0]=4; }