Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Availability of some features in OpenCL 1.2

I am a CUDA developer shifting to OpenCL. I am facing lots of difficulties in understanding which of the below mentioned features are also available in OpenCl, just like in CUDA.


  1. Overlapping Kernel Execution  with some host function 
  2. Overlapping Multiple Kernel Executions
  3. Overlapping Kernel Execution with CPU-GPU or GPU-CPU memcpy 
  4. Overlapping CPU-GPU memcpy with GPU-CPU memcpy
  5. Copying data from  Host memory to device memory Or opposite, without involving CPU/GPu, i.e DMA
  6. Copying data from one GPU to another GPU directly, just like GPUDirect 
  7. Disabling certain no. of cores in GPU
  8. Recursion, and if yes, then what is the maximum depth 

If available, are these overlapping operations Concurrently executed  or executed in parallel(as the GPU is having multiple cores)?

I shall greatly appreciate response on this  If you can just give small pointers I will start exploring the details.


8 Replies


Overall, OpenCL is slightly higher level than CUDA in terms of concurrency and asynchronous programming. In your program you state which events a memory transfer or a kernel call depends. If all these events have occured (or there are no other dependencies) then the runtime is free to issue that memory transaction or kernel call. The point is that the scheduling is left to the runtime. Whether certain things can be overlapped (e.g. a memory transfer and execution of a kernel) is up to the runtime to decide and it depends on hardware support (e.g. availability of DMA engines).

That being said, on recent AMD hardware and OpenCL runtimes points 1 through 6 are available. The NVIDIA opencl runtime has them as well. I don't think 7 is available for AMD hardware/runtime.

I'm not completely sure what you mean by recursion. If you're talking about kernels launching new kernels ("dynamic parallelism" in CUDA 5) that is not supported in OpenCL.

Hope that helps.




>Overlapping Multiple Kernel Executions / host stuff

All I know is you can overlap two independent kernels (with few mb/sec memory transfers) perfectly with ocl.

Some guidelines I've discovered:

- alternate two long (exec time is like 250..700 ms) kernels with approx. 10% overlap on every GPU devices(cores). (add more overlap if there are memory transfers).

- use different contexts for all those alternated kernels. (one context is not enough even when you use the out_of_order flag, it will leave a gap between two kernels and tries to do things sequentially)

- this is perfectly scalable on multiple gpues (you need 2x contexts on every gpu device)

- this way (2 ctx/gpu) CPU will basically sleep while the queues of the GPU Compute Units will be always filled with tasks.

>DMA transfer

Yes. There is pinned memory too. There's a long story about this in the OpenCL programming guide.

>Disabling certain no. of cores in GPU

You choose which core you using. Also with an environment variable you can restcict which cores an OpenCL application can see.

>Recursion, and if yes, then what is the maximum depth

If I search this on google, it will say NO, for the OpenCL standard.

But on the hd7970 it's not impossible (I mean, the hardware can do it -> it finally has instructions to get/set the program counter).


>Disabling certain no. of cores in GPU

You choose which core you using. Also with an environment variable you can restcict which cores an OpenCL application can see.

If I'm not mistaken this requires the device fission extension which is currently only available for CPUs (intel and AMD opencl sdk).


I mean GPU  core. Those are one separated CLDevice each.


in this code author use recursion. you can clearly see that it call raytrace() function from inside of raytrace(). but OpenCL don't allow recursion and it works because compiler it staticaly unroll as depth of recursion can be determined at compile time.


This is a very interesting example. I'm not convinced that this usage of recursion is guaranteed to work by the OpenCL spec.


indeed. it is only smart compiler.


I've tried it:

When maxTraceDepth=2 (its a ray from the screen and another reflection and refraction rays from the first ray-hit) it generates 25K completely unrolled code (on the Tahiti).

maxTraceDepth=3 -> 52KB, it goes beyond instruction cache

maxTraceDepth=4 -> the compiler freezes. (waited for 5 minutes and still nothing) This would be 1+2+4+8=15 rays total.

maxTraceDepth eliminated (forcing it not to unroll)  -> compilet time = infinite... Seems like there's no way to get a dynamic function call/return form ocl.

(But anyways for this kind of thing the better would be a queue and some worker threads processing it and inserting new rays into the queue with atomics)