7 Replies Latest reply on Apr 4, 2011 8:00 PM by jeff_golds

    OpenCL Concurrent Kernel Execution

    sir.um

      There have been several posts on the subject of OpenCL support for Concurrent Kernel Execution (CKE) on ATI Cards. The consensus seems to be that the Radeon HD 5xxx Hardware supports it but it is not yet supported in the OpenCL driver.

      Is there any indication of when or if this will be fixed in the drivers? ATI Streak SDK 2.3??

      I only ask because, aside from this increasing the speed of normal SIMD kernels, without CKE, task parallel computation [queue.enqueueTask() - kernels with a workgroup of size 1] have ZERO performance improvement, by running on a system with a single OpenCL device. Since no 2 tasks can run in parallel, and must be run 1 after the other, even if the OpenCL device has more than enough resources to run both kernels.

      Additionally, as far as I'm concerned, the ONLY benefit of NVidia over ATI is CKE. ATI consistently has better/faster hardware, and especially considering the fact that the hardware already supports CKE, it's a no brainer to implement.

      I need CKE!!

      -Chris

        • OpenCL Concurrent Kernel Execution
          LeeHowes

          Micah might be able to give you a better answer about when it will be feasible. You definitely wouldn't want to issue workgroups of size 1 anyway; EVERY task must be at least 1 hardware thread, ie a wavefront. What it would allow would be to issue long running workgroups of multiple wavefronts to overlap in the hardware. They'd have to be long running to get any advantage.

          Being able to split the device using the device fission extension would give you more control to issue multiple grids that run independently, but I don't think that will be feasible for the next few SDK versions.

            • OpenCL Concurrent Kernel Execution
              sir.um

              Taken from Khronos OpenCL v1.1 Spec. (opencl-1.1.pdf)

               

              Section 3.4.2 Task Parallel Programming Model

              The OpenCL task parallel programming model defines a model in which a single instance of a kernel is executed independent of any index space. It is logically equivalent to executing a kernel on a compute unit with a work-group containing a single work-item.



              Perhaps this is a byproduct of teaching myself OpenCL, but It was my understanding that the above section, refers to enqueueing several independent and unrelated "tasks" (single Instance kernels - workgroup size = 1) to the same device allowing non-SIMD code to execute in parallel.

              As I understand it, the GPU architecture on the Radeon 5870, for example, is composed of 20 compute units each of which containing 80 stream processors. While all of the stream processors of a single compute unit must run the same kernel, each compute unit can independently run different kernels. That being the case, task parallelism/single workitem tasks could still benefit from running up to 20 tasks in parallel. Correct?

              What is the device fission extension? I have never heard of that. Where can I learn more about that?

              -Chris



                • OpenCL Concurrent Kernel Execution
                  LeeHowes

                  Yes, that's what it means. It's a bad thing to do on the GPU, though, and as far as I know enqueued tasks will only run on the host.

                  Remember that a hardware thread on the GPU is a 64-wide SIMD issue with a single program counter. If you write a task that is narrower than 64 work items, you are losing efficiency.

                  You are right, it could benefit from 20 tasks in parallel. The hardware isn't capable of that, it has state management able to run 5 or so in parallel. So what you really want is tasks that are multiple waves wide, or more generally are issuing multiple instances of a 1-wave wide task in each launch. Theoretically the hardware can do that, and it works in DX. However, if the kernels are not running for long, the overhead of enqueuing those tasks will be substantial.

                  The device fission extension allows you to split a single CL device into multiple subdevices. In the CPU this means we can have a separate device, and hence a separate queue, per CPU core. On the GPU we can not currently do this (it's too dynamic in hardware at the moment).

                  If you're interested there is going to be a webinar on the subject next week:

                  http://developer.amd.com/zones/OpenCLZone/Events/pages/OpenCLWebinars.aspx

                   

                  What I would say is that any kernel dispatch is a task. A task can be 1 or more waves wide. There is little meaningful distinction between a strict CL task and a CL data-parallel kernel, it's a naming trick. On a CPU, currently, a "wavefront" (there is no official OpenCL term) is 1 work-item wide because we do not automatically compact work items into SSE vectors.

                  • OpenCL Concurrent Kernel Execution
                    jeff_golds

                     

                    Originally posted by: sir.um Taken from Khronos OpenCL v1.1 Spec. (opencl-1.1.pdf)

                     

                     

                    Section 3.4.2 Task Parallel Programming Model The OpenCL task parallel programming model defines a model in which a single instance of a kernel is executed independent of any index space. It is logically equivalent to executing a kernel on a compute unit with a work-group containing a single work-item.


                     

                    Perhaps this is a byproduct of teaching myself OpenCL, but It was my understanding that the above section, refers to enqueueing several independent and unrelated "tasks" (single Instance kernels - workgroup size = 1) to the same device allowing non-SIMD code to execute in parallel.

                     

                    As I understand it, the GPU architecture on the Radeon 5870, for example, is composed of 20 compute units each of which containing 80 stream processors. While all of the stream processors of a single compute unit must run the same kernel, each compute unit can independently run different kernels. That being the case, task parallelism/single workitem tasks could still benefit from running up to 20 tasks in parallel. Correct?



                    No.  That would require storing 20 different programs (pointers to programs really) and related state on the ASIC at once and it's not physically possible.  If you look around you can find the register spec for Evergreen-class parts.  You'll find that there is a maximum of 8 copies of the context registers.  This limits you to a maximum of 8 unique programs that could be running at once.  In practice, you might not get 8 because we may reserve some of the contexts.

                    We're making improvements to the software stack that will make it possible to run more than one kernel at a time, but I can't give a timeframe as many changes need to be done.

                    Also, please don't get in the habit of doing dispatches of a single thread.  The basic work unit for the SIMDs is a wavefront and if you only submit a single thread a time, most of the wavefront sits there idle.  The GPU is happiest with lots of work, so submit as many wavefront as you can.

                    Jeff

                      • OpenCL Concurrent Kernel Execution
                        edward_yang

                         

                        Originally posted by: jeff_golds

                         

                        No.  That would require storing 20 different programs (pointers to programs really) and related state on the ASIC at once and it's not physically possible.  If you look around you can find the register spec for Evergreen-class parts.  You'll find that there is a maximum of 8 copies of the context registers.  This limits you to a maximum of 8 unique programs that could be running at once.  In practice, you might not get 8 because we may reserve some of the contexts.

                         

                        We're making improvements to the software stack that will make it possible to run more than one kernel at a time, but I can't give a timeframe as many changes need to be done.

                         

                        Also, please don't get in the habit of doing dispatches of a single thread.  The basic work unit for the SIMDs is a wavefront and if you only submit a single thread a time, most of the wavefront sits there idle.  The GPU is happiest with lots of work, so submit as many wavefront as you can.

                         

                        Jeff

                         

                        @Jeff: I was looking for info about opencl device fission and got to this page. Thanks for the explanation. Thanks for the explanation. I have a few related questions, though.

                        - What do you mean by "reserve some of the context"? For normal gpu-display processing?

                        - As I understand a workgroup should not have too many wavefronts to cause register overflow?

                        Is there a clear documentation or simple benchmark to show the register file size of each AMD/ATI GPU? I think that'd be very useful for opencl optimization. Thanks.

                          • OpenCL Concurrent Kernel Execution
                            himanshu.gautam

                            edward,

                            2. Having more number of wavefronts in a workgroup is generally a good idea. It is recommended to have atleast 2 wavefront per workgroup.

                            3. You can refer to appendix D "Device Parameters" of OpenCL Programming guide to know the register file size for all the GPUs. You can check the GPRs used by your kernel using SKA or Profiler and optimize for the number of wavefronts that best suits your problem.

                            • OpenCL Concurrent Kernel Execution
                              jeff_golds

                               

                              Originally posted by: edward_yang

                               

                              - As I understand a workgroup should not have too many wavefronts to cause register overflow?

                               

                              Is there a clear documentation or simple benchmark to show the register file size of each AMD/ATI GPU? I think that'd be very useful for opencl optimization. Thanks.

                               

                              Register file overflow mainly occurs when the group size for your dispatch is large.  If you have a group size equal to the wavefront size (e.g. 64 threads on most AMD GPUs), then you will have the best chance to avoid spilling.  Scheduling more work to a SIMD doesn't cause spilling; any wavefronts not able to be scheduled will just wait until earlier wavefronts are complete.

                              Jeff