Archives Discussions

maelteir · ‎08-29-2011

Hi all,

I am trying to write a code that needs collaboration between the host and device. Specifically:

1- The device sets a variable X to a non-zero value and waits for Y to be non-zero.

2- The host reads X, when it becomes non-zero, it sets Y to non-zero value.

Below is the code running on the GPU and CPU.

This code hangs, so I assume that reading and writing by the host to variables manipulated by the GPU kernel is not possible while the kernel is running. Is that true?

Thank you.

-------------------- GPU kernel code -------------------- int tid = get_global_id(0); Y[tid]=0; X[tid]=tid; while (1) { if ( Y[tid] > 0 ) break; } ----------- CPU code ----------- for (int i=0; i< numThreads; i++) Y=0; count=0; while (1) { clEnqueueReadBuffer(...,X,....) for (int i=0; i< numThreads; i++) { if ( Y == 0) { if (X > 0) { Y=1; count++; } } if (count > 0 ) clEnqueueWriteBuffer(...,Y,....) if (count == numThreads) break; //All threads are done }

notzed · ‎08-31-2011

Not the way you've done it, and i doubt it's possible (at least, not portably), I'd also question if it were even desirable.

The memory model allows that Y[tid] to be cached in a register for the whole run of the kernel. Even with a global barrier() and persumably making it volatile, it could be cached in the local work-group multi-processor (MP) (my reading of section 3.3.1 of the spec).

Atomic functions are multi-MP coherent, but I don't know if that extends to device memory writes from the host.

Not sure about aync_workgroup_copy() either.

Also queues by default (and all implementations?) are in-order, so you couldn't invoke a kernel and then a memory copy - the memory copy will (appear to) wait till the kernel is done before executing. Although you could try multiple queues.

The whole design looks very suspect to start with - why have 1 thread on the cpu perform per-thread dispatch decisions when there's potentially thousands of threads on the other end that can make the decision without an extremely expensive synchronisation every iteration.

antzrhere · ‎08-31-2011

Sadly, the AMD SDK currently doesn't support out of order processing OR parallel execution of multiple command queues on the GPU either, although I think parallel execution is supported on the CPU.

himanshu_gautam · ‎08-31-2011

maeltier,

In my opinion , it is not posssible to check the value of some variable while the kernel is running.

Can you describe some scenario where this could be useful.

antzehere,

I don't think out-of-order command execution can solve this issue, as it is a opencl spec requirement.

maelteir · ‎09-01-2011

Thank you all for your replies.

The code attached is a simplification of what I want to do, so It may look strange.

One case where this collaboration between CPU and GPU is required is when you want to support communication across GPUs. You can take a look on this paper that is implemented using CUDA and NVIDIA GPUs: http://academic.research.microsoft.com/Publication/4856677

By the way, I also tried using aync_workgroup_copy(), and two command queues for the kernel and memory transfers, but it didn't work.

Thank you

corry · ‎09-02-2011

himanshu.gautam:

Call me crazy, but from the marketing stuff, I was led to believe that this sort of thing would be possible in future APU's, am I wrong? The benefits are huge!

In my perfect world, IL would be able to function basically as an extension to X86, much like SSE is currently. (Shared ram, shared cache, shared virtual address space, etc, but considered a seperate "coprocessor", much like x87 is also a "coprocessor" (I know, I had a 486SX...it used to actually be a physically seperate coprocessor).

This would mean I could have 1 core of an APU (with its own set of GPU type processors and x86 processor) producing results, and another core of APU's consuming them, perhaps with one of the CPU Cores in the middle doing something like byteswapping which, with SSSE3, its incredibly good at (PSHUFB).

Thats just off the top of my head. I don't think heterogeneous computing needs to mean high cost of overhead from switching from serial, or slightly parallel to massivly parallel and back.

I know, I just barely sign up and start making demands. Just take it as I'm very excited about the future possibilities of the APU design!

MicahVillmow · ‎09-02-2011

corry,
We have an internal test that does this exact thing. It actually exposed some issues with SDK 2.5. I've posted the test kernel below. However, it is highly OS/Device dependent. Depending on the device and OS driver model, this will work or cause the card to reset.

The problem with this isn't that it isn't possible, but that the device is not pre-emptible, which the OS does not like. So any long running kernel will cause problems, and since there is no guarantee when a write from the CPU is seen by the GPU, it makes these types of program unreliable.

__kernel void volatile_test(volatile __global uint* input) { input[128]=1; // GPU received single from CPU too early if(input[0]==2) { input[0]=16; return; } for(uint i=0; i<8065975/40; i++) //roughly 0.25 seconds delay on Cayman { uint temp=input[0]; if(temp==2) { // GPU received signal from CPU at the correct point input[0]=8; return; } } //Detect GPU never recieved signal. input[0]=4; }

Archives Discussions

Is it possible for the host to read/write a buffer written/read by the GPU kernel while the kernel is running?