8 Replies Latest reply on Jul 22, 2012 4:29 AM by realhet

    Availability of some features in OpenCL 1.2


      I am a CUDA developer shifting to OpenCL. I am facing lots of difficulties in understanding which of the below mentioned features are also available in OpenCl, just like in CUDA.


      1. Overlapping Kernel Execution  with some host function 
      2. Overlapping Multiple Kernel Executions
      3. Overlapping Kernel Execution with CPU-GPU or GPU-CPU memcpy 
      4. Overlapping CPU-GPU memcpy with GPU-CPU memcpy
      5. Copying data from  Host memory to device memory Or opposite, without involving CPU/GPu, i.e DMA
      6. Copying data from one GPU to another GPU directly, just like GPUDirect 
      7. Disabling certain no. of cores in GPU
      8. Recursion, and if yes, then what is the maximum depth 


      If available, are these overlapping operations Concurrently executed  or executed in parallel(as the GPU is having multiple cores)?


      I shall greatly appreciate response on this  If you can just give small pointers I will start exploring the details.



        • Re: Availability of some features in OpenCL 1.2



          Overall, OpenCL is slightly higher level than CUDA in terms of concurrency and asynchronous programming. In your program you state which events a memory transfer or a kernel call depends. If all these events have occured (or there are no other dependencies) then the runtime is free to issue that memory transaction or kernel call. The point is that the scheduling is left to the runtime. Whether certain things can be overlapped (e.g. a memory transfer and execution of a kernel) is up to the runtime to decide and it depends on hardware support (e.g. availability of DMA engines).


          That being said, on recent AMD hardware and OpenCL runtimes points 1 through 6 are available. The NVIDIA opencl runtime has them as well. I don't think 7 is available for AMD hardware/runtime.


          I'm not completely sure what you mean by recursion. If you're talking about kernels launching new kernels ("dynamic parallelism" in CUDA 5) that is not supported in OpenCL.


          Hope that helps.



          • Re: Availability of some features in OpenCL 1.2

            >Overlapping Multiple Kernel Executions / host stuff


            All I know is you can overlap two independent kernels (with few mb/sec memory transfers) perfectly with ocl.

            Some guidelines I've discovered:

            - alternate two long (exec time is like 250..700 ms) kernels with approx. 10% overlap on every GPU devices(cores). (add more overlap if there are memory transfers).

            - use different contexts for all those alternated kernels. (one context is not enough even when you use the out_of_order flag, it will leave a gap between two kernels and tries to do things sequentially)

            - this is perfectly scalable on multiple gpues (you need 2x contexts on every gpu device)

            - this way (2 ctx/gpu) CPU will basically sleep while the queues of the GPU Compute Units will be always filled with tasks.


            >DMA transfer

            Yes. There is pinned memory too. There's a long story about this in the OpenCL programming guide.


            >Disabling certain no. of cores in GPU

            You choose which core you using. Also with an environment variable you can restcict which cores an OpenCL application can see.


            >Recursion, and if yes, then what is the maximum depth

            If I search this on google, it will say NO, for the OpenCL standard.

            But on the hd7970 it's not impossible (I mean, the hardware can do it -> it finally has instructions to get/set the program counter).

            • Re: Availability of some features in OpenCL 1.2

              in this code http://www.gamedev.net/blog/1241/entry-2254210-realtime-raytracing-with-opencl-ii/ author use recursion. you can clearly see that it call raytrace() function from inside of raytrace(). but OpenCL don't allow recursion and it works because compiler it staticaly unroll as depth of recursion can be determined at compile time.