15 Replies Latest reply on Jan 24, 2012 9:55 AM by evk8888

    Concurrent kernel execution + some other questions

    Wibowit
      Does AMD APP SDK fully support out-of-order queues and concurrent kernel execution

      Hi,

       

      I couldn't find a info about level of task parallelism in AMD's OpenCL implementation.

       

      OpenCL provides two ways to achieve task level parallelism:

      - multiple in-order queues,

      - one out-of-order queue,

       

      I am curious which ones are fully implementend. I've read somewhere that APP SDK ignores event and executes kernel invocations in order. Also I've read that APP SDK support multiple queues and with multiple queues one can execute different kernels concurrently. But maybe when I create multiple queues then each one is binded to different set compute units?

       

      I'm designing sorting algorithm (BWT transform) + compression and I plan to have many kernels, one heavy on LDS (initial sorting on small blocks), some heavy on global bandwidth (global sorting), some heavy on ALU & registers (compression) and I would want to have my GPU fully utilized. BWT is a block transformation so one block could be sorted and another block could be compressed in parallel.

       

      Another questions are:

      Why OpenCL has halved LSD bandwidth? I've read that in AMD's documentation.

       

      Do APP SDK compiler accept some parameters to choose level of optimization, similiar to -O2 or -O3 in GCC?

        • Concurrent kernel execution + some other questions
          himanshu.gautam

          Wibowit,

          out of order queue is not supported. But multiple inorder queues are supported . I am not aware of any event hadling issues.

           

          Can you please mention where it is mentioned that LDS bandwidth has been halved?

           

          • Concurrent kernel execution + some other questions
            MicahVillmow
            concurrent kernel execution is not supported in SDK 2.4 and not planned for 2.5 either.
              • Concurrent kernel execution + some other questions
                s58000

                bad news !!!

                I don't really want to stick with CUDA, i hope this will be done in some future release ....

                thank you anyway

                  • Concurrent kernel execution + some other questions
                    KenDomino

                    Sorry for reviving this thread. I've also looked at this a bit, and you may be interested in what I found.

                    I wrote a program that executes a kernel using clEnqueueTask that shows that conncurrent kernels does not work--at least for my GPU HD 6450, SDK v2.5.  However, concurrency via multiple queues works for my quad-core OpenCL CPU device.  Unfortunately, there seems to be a max of about 44 queues that can be created.  (That should be fine until we have a 44-core CPU.)  CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE does not work for either GPU or CPU.  The code is here.

                    I also have a NVIDIA Fermi card, and there is no way to get concurrent kernels on that using their OpenCL (driver v280.19 w/ OpenCL 1.1), with either single out-of-order queue or multiple queues.  You can get concurrent kernels using CUDA Runtime API, but then it's not in OpenCL.

                    Ken

                      • Concurrent kernel execution + some other questions
                        genaganna

                         

                        Originally posted by: KenDomino  I wrote a program that executes a kernel using clEnqueueTask that shows that conncurrent kernels does not work--at least for my GPU HD 6450, SDK v2.5.

                        Conncurrent kernel exection is not supported yet on GPUs.

                         However, concurrency via multiple queues works for my quad-core OpenCL CPU device.  Unfortunately, there seems to be a max of about 44 queues that can be created.  (That should be fine until we have a 44-core CPU.)

                        Are you getting appropriate error message if more than 44 command queue are created?

                         CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE does not work for either GPU or CPU.  The code is here.

                         

                        CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not supported yet either on CPU  and GPU.

                          • Concurrent kernel execution + some other questions
                            Tristan23

                             

                            Originally posted by: genaganna

                             

                            CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not supported yet either on CPU  and GPU.

                             

                             

                            Thats strange. I thought out-of-order was a mandatory feature in the OpenCL spec ...

                            • Concurrent kernel execution + some other questions
                              KenDomino

                              Yep, I do get an error return (out of resources) on the command queue create.  That's fine.

                              I wasn't sure about the AMD OpenCL or GPUs, but CUDA API itself does support concurrent kernels.

                                • Re: Concurrent kernel execution + some other questions
                                  evk8888

                                  hi,  this thread was very useful checking concurrent execution of tasks in opencl. In my experiments I am creating many queues and using it to enqueue different kernels. I am not able to see complete utilization of all the cores in the machine (equal to number of queues declared).. its a intel processor with 24 cores and its showing utilization not more than 3 to 6 cores at a time (having something like 10 queues).. did someone check the utilization of the cpu when running with many queues ?? or is there some way to do it... without using device fission...

                                   

                                  some values for 10 queues

                                    13.510468 seconds elapsed for concurrent (utilization around 4 cores (stable))

                                    20.554451 seconds elapsed for sequential (1 core)

                                  Thanks