3 Replies Latest reply on Apr 10, 2015 4:22 AM by dipak

    OpenCL Questions (mostly 2.0)


      Hi, being self-tought in openCL, I have a lot of questions that I cannot find in documents or via a google search (maybe I'm just typeing the wrong stuff).  Can someone help me with any of these questions?



      1) When creating SVM memory with clSVMAlloc(...), it returns a CPU pointer that points to GPU memory. Am I correct?



      typedef struct { Node *node } Node;

      2) I want to use my Node structure in my kernel using SVM (1000 values). I don't know how to initialize the node values.  I want each kernel instance to access the 0th element (myLinkedList pointer) and traverse from there.  The code below shows how I am thinking.  Please help me change this code to be useful:



        Node *myLinkedList = clSVMAlloc(myContext, CL_MEM_READ_WRITE, 1000, 0);


        Node *current = myLinkedList;

        for (int index = 0; index < 1000; index++)


        //current->node does not point to anything yet.... we must point to something!

        current->node = new Node(); //I'm doing something wrong here!!! Doesn't this allocate on the CPU?

        current = current->node;





      3) What does clEnqueSVMMap(...) ACTUALLY do.. is it a mutex lock? Does it copy memory?  Is it fast (can I use it a lot)?



      4) OpenCL 2.0 introduces Pipes and everyone seems to think they are a great thing.  My understanding is that they are used for 2 kernels to talk to each other Without having to go to the host(CPU).  Do these kernels have to be run on the same device? Also, How do I get a pointer(to the kernel) to the other end of the pipe? Can I pass a value into a pipe, and after my kernel finishes, run another kernel and retrieve the piped value (same gid) from the pipe (or are buffers easier)?



      5) Can device enque call another kernel in another *.cl file?  In other words, If I create the cl_kernel to kernel#2 on the host, can I pass that cl_kernel to kernel#1 as an argument and have the kernel#1 then call kernel#2, or would the cpu queue be used for that stuff?  If I used the cpu queue, doesn't that cause more overhead?



      5) OpenCl General question: (I don't know how I even got this far without knowing this) When I enqueue a kernel, I assume that it performs in the background while my cpu keeps going.  If I enqueue another one immediately after, it is put into a queue where that kernel is executed next.  How do I have opencl use a callback function when kernel#1 is complete, and a different function when kernel#2 is complete? Can the callbacks be member functions(void myClass::MyFunction()) w/o using boost libraries (c++11 is okay)? 



      Note) A HUGE thanks to the OpenCL guys for making nonuniform workgroup sizes! I had to use "if (gid < numThreads)" in every kernel... now I don't have to!!! I know there are better reasons to use 2.0 out there, but this is the #1 reason that I will moving to OpenCL 2.0 (and therefore use AMD graphics cards in the future when I like cutting-edge computing)





        • Re: OpenCL Questions (mostly 2.0)


          Thanks for asking! I'll keep an eye on this, but since it's a general "how do I" question I'm leaving this for the community to answer.

          • Re: OpenCL Questions (mostly 2.0)

            For those reading this, I thought I should share my findings:



            1) Correct.

            2) https://software.intel.com/en-us/articles/opencl-20-shared-virtual-memory-overview

              This intel website gives a great overview of how to use certain SVM components.

              It turns out that if I want to use an array structure like I asked for, instead of using the 'new' keyword, I allocate each linked node using clSVMAlloc(...), The root node can be sent using clSetKernelArgSVMPointer, but each child node  has to be passed to the kernel using a clSetKernelExecInfo command... I'm still unsure if this has to be called every time you call the kernel.

              In my own coding, I have not yet worked on this method, and SVM seems to crash my system when I send it to a kernel.  To get fine-grain memory (on the gpu), I allocate memory like so:


              void *mem = clSVMAlloc(_context, CL_MEM_READ_WRITE | CL_MEM_SVM_FINE_GRAIN_BUFFER, size, 0);

                cl_mem buffer = clCreateBuffer(_context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, size, mem, NULL);

                return mem;


                ... and I send "buffer" as a kernel argument using the well-known clSetKernelArg command.  It works like a charm

              • Re: OpenCL Questions (mostly 2.0)

                Hi zypo,


                Please find my reply as below. Hope these  answers helpful to you.


                1) clSVMAlloc() allocates a shared virtual memory(SVM) and returns the SVM memory pointer. SVM enables the host and devices to see a unified virtual address space and hence any SVM pointer can be directly accessed from host and devices. The physical placement of the SVM depends on the particular system setup. As mentioned in the Programming guide:

                "Support for SVM does not imply or require that the host and the OpenCL devices in an OpenCL 2.0 compliant architecture share actual physical memory. The OpenCL runtime manages the transfer of data between the host and the OpenCL devices; the process is transparent to the programmer, who sees a unified address space."

                2) To know the usage of SVM, you may also check the SVM related APP SDK samples (e.g. SVMBinaryTreeSearch).

                3) clEnqueueSVMMap() acts a synchronization point. Its more relevant for coarse grain svm buffer. There is no implicit locking done. Programmer must ensure that the mapped svm buffer should not be modified by other devices when host is updating it. Behavior of concurrent update to a svm buffer without proper synchronization is undefined.

                Now coming to data movement point. The data movement depends on actual physical placement of SVM. If SVM is allocated on a shared physical memory, there may not require any data movement at all. Whereas, in case of discrete gpu, runtime may need to move the data during the map/unmap calls.


                i) I don't think so. I guess kernels can be run on any device associated with the context where pipe is created.

                ii) There is no such API to get a kernel pointer to the other end of the pipe.

                iii) Yes, it is possible.

                5) I didn't get your point clearly specially the *.cl part.

                Device-side enqueuing requires kernel code to be represented in block syntax. To reuse the kernel code, you can wrap the kernel call inside a block and then enqueue the block to the device. For, e.g.

                void kernel kernel1(...) {...}

                void kernel kernel2(...)


                   void (^block)(void) = ^{ kernel1(...); };


                  enqueue_kernel(...,  block);




                Currently, OpenCl 2.0 restricts that child kernel should be enqueued to same device as parent.

                6) You may use OpenCL event objects for this purpose. You can define your own callback mechanism based on the execution status of the command identified by the events using clGetEventInfo API. These event objects returned by the enqueue command calls can be used to synchronize other enqueue commands in the pipeline.