Archives Discussions

gpugpu2012 · ‎07-21-2012

I am a CUDA developer shifting to OpenCL. I am facing lots of difficulties in understanding which of the below mentioned features are also available in OpenCl, just like in CUDA.

:

Overlapping Kernel Execution with some host function
Overlapping Multiple Kernel Executions
Overlapping Kernel Execution with CPU-GPU or GPU-CPU memcpy
Overlapping CPU-GPU memcpy with GPU-CPU memcpy
Copying data from Host memory to device memory Or opposite, without involving CPU/GPu, i.e DMA
Copying data from one GPU to another GPU directly, just like GPUDirect
Disabling certain no. of cores in GPU
Recursion, and if yes, then what is the maximum depth

If available, are these overlapping operations Concurrently executed or executed in parallel(as the GPU is having multiple cores)?

I shall greatly appreciate response on this If you can just give small pointers I will start exploring the details.

Thanks,

dmeiser · ‎07-21-2012

Hi,

Overall, OpenCL is slightly higher level than CUDA in terms of concurrency and asynchronous programming. In your program you state which events a memory transfer or a kernel call depends. If all these events have occured (or there are no other dependencies) then the runtime is free to issue that memory transaction or kernel call. The point is that the scheduling is left to the runtime. Whether certain things can be overlapped (e.g. a memory transfer and execution of a kernel) is up to the runtime to decide and it depends on hardware support (e.g. availability of DMA engines).

That being said, on recent AMD hardware and OpenCL runtimes points 1 through 6 are available. The NVIDIA opencl runtime has them as well. I don't think 7 is available for AMD hardware/runtime.

I'm not completely sure what you mean by recursion. If you're talking about kernels launching new kernels ("dynamic parallelism" in CUDA 5) that is not supported in OpenCL.

Hope that helps.

Cheers,

Dominic

realhet · ‎07-21-2012

>Overlapping Multiple Kernel Executions / host stuff

All I know is you can overlap two independent kernels (with few mb/sec memory transfers) perfectly with ocl.

Some guidelines I've discovered:

- alternate two long (exec time is like 250..700 ms) kernels with approx. 10% overlap on every GPU devices(cores). (add more overlap if there are memory transfers).

- use different contexts for all those alternated kernels. (one context is not enough even when you use the out_of_order flag, it will leave a gap between two kernels and tries to do things sequentially)

- this is perfectly scalable on multiple gpues (you need 2x contexts on every gpu device)

- this way (2 ctx/gpu) CPU will basically sleep while the queues of the GPU Compute Units will be always filled with tasks.

>DMA transfer

Yes. There is pinned memory too. There's a long story about this in the OpenCL programming guide.

>Disabling certain no. of cores in GPU

You choose which core you using. Also with an environment variable you can restcict which cores an OpenCL application can see.

>Recursion, and if yes, then what is the maximum depth

If I search this on google, it will say NO, for the OpenCL standard.

But on the hd7970 it's not impossible (I mean, the hardware can do it -> it finally has instructions to get/set the program counter).

dmeiser · ‎07-21-2012

>Disabling certain no. of cores in GPU
You choose which core you using. Also with an environment variable you can restcict which cores an OpenCL application can see.

If I'm not mistaken this requires the device fission extension which is currently only available for CPUs (intel and AMD opencl sdk).

realhet · ‎07-22-2012

I mean GPU core. Those are one separated CLDevice each.

nou · ‎07-21-2012

in this code http://www.gamedev.net/blog/1241/entry-2254210-realtime-raytracing-with-opencl-ii/ author use recursion. you can clearly see that it call raytrace() function from inside of raytrace(). but OpenCL don't allow recursion and it works because compiler it staticaly unroll as depth of recursion can be determined at compile time.

dmeiser · ‎07-21-2012

This is a very interesting example. I'm not convinced that this usage of recursion is guaranteed to work by the OpenCL spec.

nou · ‎07-22-2012

indeed. it is only smart compiler.

realhet · ‎07-22-2012

I've tried it:

When maxTraceDepth=2 (its a ray from the screen and another reflection and refraction rays from the first ray-hit) it generates 25K completely unrolled code (on the Tahiti).

maxTraceDepth=3 -> 52KB, it goes beyond instruction cache

maxTraceDepth=4 -> the compiler freezes. (waited for 5 minutes and still nothing) This would be 1+2+4+8=15 rays total.

maxTraceDepth eliminated (forcing it not to unroll) -> compilet time = infinite... Seems like there's no way to get a dynamic function call/return form ocl.

(But anyways for this kind of thing the better would be a queue and some worker threads processing it and inserting new rays into the queue with atomics)

Archives Discussions

Availability of some features in OpenCL 1.2