9 Replies Latest reply on May 28, 2011 9:42 AM by Dr.Haribo

    How do I execute kernels without 100% CPU busy-wait?

    Dr.Haribo

      OS: 64-bit Windows 7

      CPU: Intel Core 2 Duo E8400

      GPU 1: AMD Radeon 6990 (dual Cayman) with AMD_Catalyst_11.5a_Hotfix_8.85.6RC2_Win7_May13

      GPU 2: nVidia GeForce GTX 580 with 270.61 drivers (2011.04.18)

      An admittedly aging CPU with two cores driving three top-of-the-line GPUs. But still, I don't see why it would take 100% CPU power to run a SHA-256 hashing kernel - a compute intensive task with very little data to transfer to/from the CPU. Actual CPU usage should be more like 0.01%.

      I am having problems keeping all three GPUs running full speed and no matter what I try I cannot get the CPU load down.

      As any blocking OpenCL call triggers 100% CPU usage, I tried polling an event on the running kernel with thread sleeps in between. I ran into two problems:

      1. As soon as I enqueue a kernel it starts to run on the GeForce, but on the Radeon it just sits there with state CL_QUEUED indefinitely, unless I call some blocking operation. clWaitForEvents, clFinish, clFlush will get things moving, but they block with a busy-wait.

      This is odd, because on page 27 (1-13) I find this:

      "Unless the GPU compute device is busy, commands are executed immediately."

      in the following document:

      http://developer.amd.com/gpu/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

      Is it a bug?

      2. Even with all my threads asleep, there is no improvement. There's always a thread in amdocl.dll and nvcuda.dll eating all available CPU cycles. Do both drivers always busy-wait to detect events from the GPUs? I see this in all OpenCL programs, not just my own.

      After realizing that both drivers insist on using 100% CPU cycles as long as the GPUs are working, I thought I could at least make the OpenCL kernels run smoothly even if it lags everything else on the computer. The idea was to queue several kernels at once. This way, even if the thread running a GPU didn't get any CPU cycles for a while, the GPU would still have queued kernels to run until the CPU thread could catch up and enqueue more.

      I tried this with one thread per OpenCL device (3 in all), first with several kernel invocations on the same queue, later with several queues with only one kernel on each queue.

      It works for the GeForce. As long as the CPU thread gets to run every once in a while it can keep plenty of work stacked up for the GPU to run at full speed.

      For the dual Radeons I was less lucky. There seems to be no way to get a kernel to start executing without also blocking my thread until the kernel finishes. Maybe someone can point me to a way, if there is one? With many kernels enqueued in the same queue, calling clFlush on the queue blocks until all of them finish.

      The answer seems to be to keep several CPU threads for each GPU. Each of the threads with its own queue, pushing one kernel invocation at a time at the GPU. I guess I will try that next.

      I already spent too much time on this, and I'm thinking that this is too silly - surely it must be me doing it wrong. What's the proper way to deal with this?

       

        • How do I execute kernels without 100% CPU busy-wait?
          maximmoroz

          > but on the Radeon it just sits there with state CL_QUEUED indefinitely, unless I call some blocking operation. clWaitForEvents, clFinish, clFlush will get things moving, but they block with a busy-wait

          This is how it should work. Enqueued actions are not sent immidiately to devices. But there is one error in your sentence: clFlush is actually a non-blocking call. It is how it differs from clFinish.

          But I thinks there is one case when the call to the clFlush becomes blocking. When you enqeue blocking buffer operations. Do you? Enqueueing blocking buffer operation efficiently does flush and waiting for the whole batch to complete.

          > "Unless the GPU compute device is busy, commands are executed immediately."

          Well, this sentence is valid if the command queue is flushed.

          > The idea was to queue several kernels at once

          It is makes sense to implement this idea even if you don't have any problems with high CPU load.

          Speaking about busy-waits... It is a good idea. I guess it helps reduce kernel launch time. Actually if the wait time is long enough the thread goes to sleep state.

          P.S. I suggest you to read "The OpenCL Specification version 1.1" document. All I said is mentioned there.

            • How do I execute kernels without 100% CPU busy-wait?
              Dr.Haribo

               

              Originally posted by: maximmoroz

               

              But I thinks there is one case when the call to the clFlush becomes blocking. When you enqeue blocking buffer operations. Do you? Enqueueing blocking buffer operation efficiently does flush and waiting for the whole batch to complete.

               

              I enqueue the kernel non-blocking, then, shortly after, I enqueue a blocking operation to read the data processed by the kernel. I put in a clFlush() after enqueuing the kernel, though, to make sure it gets going right away.

               

              Speaking about busy-waits... It is a good idea. I guess it helps reduce kernel launch time. Actually if the wait time is long enough the thread goes to sleep state.

               

               

               

              In my opinion busy-waits are rarely a good idea. It is very wasteful with CPU cycles, obviously, and probably some bus as well, as it has to read the finished/not-finished state from somewhere in its loop (main memory? pci-x bus?). Usually this sort of thing is solved with hardware interrupts.

              However, you may be right that it can get kernels launched a little bit faster. I wish it was possible for the programmer to choose a different solution with lower CPU-usage, though.

                • How do I execute kernels without 100% CPU busy-wait?
                  maximmoroz

                   

                  Originally posted by: Dr.Haribo

                  I enqueue the kernel non-blocking, then, shortly after, I enqueue a blocking operation to read the data processed by the kernel. I put in a clFlush() after enqueuing the kernel, though, to make sure it gets going right away.

                  It doesn't look as the most efficient way to organize tasking to 3 different devices. When you are waiting for the blocking read to finish other 2 devices might have already finished running any kernels and are staying idle now.

                  You might make the problem less by enqueueing kernel launch AND subsequent non-blocking read to all 3 devices (non-blocking reads should obviously transfer data to 3 different host located memory buffers), doing clFlush for all 3 command queues and then, when all 3 devices are loaded with work (kernels + read), you will call clFinish subsequently for all 3 of them. Thus you will get the speed = (speed of the slowest device) * 3.

                  As far as I remember NVidia implementes OpenCL 1.0 thus you are not able to use clSetEventCallback with NVidia provided platform. Still you can use these functions with AMD platform (OpenCL 1.1) to implement any type of "wait for completion" lock, either spin- or sleep- one, when you are able to wait for one of the events to be released.

                    • How do I execute kernels without 100% CPU busy-wait?
                      himanshu.gautam

                      Dr. Haribo,

                      I do not see large CPU usage while running my kernels on single/multigpu.

                      I am using 2 barts on vista64.

                      As you said you see hight CPU usage while running something on a single device of 6990, can you say you see so for the SDK samples also?

                      • How do I execute kernels without 100% CPU busy-wait?
                        Dr.Haribo

                         

                        Originally posted by: maximmoroz

                         

                        It doesn't look as the most efficient way to organize tasking to 3 different devices. When you are waiting for the blocking read to finish other 2 devices might have already finished running any kernels and are staying idle now.

                         

                         

                        I'm using one thread on the host for each OpenCL device. I may have to try using 2 or 3 threads per device.

                        But in theory, if clFlush() doesn't block, then I should be able to queue several kernels in the same queue, then in a loop wait for the oldest to finish and queue a new kernel every time that happens. I'm not sure why I had problems doing that - could be a bug in my code.

                         

                        Originally posted by: himanshu.gautam Dr. Haribo,

                        I do not see large CPU usage while running my kernels on single/multigpu.

                        I am using 2 barts on vista64.

                        As you said you see hight CPU usage while running something on a single device of 6990, can you say you see so for the SDK samples also?

                        Hmm, that gives me hope, the fact that you don't see the same on Vista.

                        But I have tried running a few different OpenCL programs that behave differently and are written in different languages - they all excibit the same busy-wait behavior.

                        I have not tried running the SDK samples. Maybe I will give that a shot, but I don't expect it will be any different.

                  • How do I execute kernels without 100% CPU busy-wait?
                    himanshu.gautam

                     

                    Originally posted by: Dr.Haribo OS: 64-bit Windows 7

                     

                    CPU: Intel Core 2 Duo E8400

                     

                    GPU 1: AMD Radeon 6990 (dual Cayman) with AMD_Catalyst_11.5a_Hotfix_8.85.6RC2_Win7_May13

                     

                    GPU 2: nVidia GeForce GTX 580 with 270.61 drivers (2011.04.18)

                     

                    An admittedly aging CPU with two cores driving three top-of-the-line GPUs. But still, I don't see why it would take 100% CPU power to run a SHA-256 hashing kernel - a compute intensive task with very little data to transfer to/from the CPU. Actual CPU usage should be more like 0.01%.

                     

                    I am having problems keeping all three GPUs running full speed and no matter what I try I cannot get the CPU load down.

                     

                    As any blocking OpenCL call triggers 100% CPU usage, I tried polling an event on the running kernel with thread sleeps in between. I ran into two problems:

                     

                    1. As soon as I enqueue a kernel it starts to run on the GeForce, but on the Radeon it just sits there with state CL_QUEUED indefinitely, unless I call some blocking operation. clWaitForEvents, clFinish, clFlush will get things moving, but they block with a busy-wait.

                     

                    This is odd, because on page 27 (1-13) I find this:

                     

                    "Unless the GPU compute device is busy, commands are executed immediately."

                     

                    in the following document:

                     

                    http://developer.amd.com/gpu/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

                     

                    Is it a bug?



                    This is not a bug. OpenCL implementation always tries to club many enqueueCommands together to be sent to GPUs. So it is always advised to use clFlush/clFinish after any command which you want to start immidiately.

                    2. Even with all my threads asleep, there is no improvement. There's always a thread in amdocl.dll and nvcuda.dll eating all available CPU cycles. Do both drivers always busy-wait to detect events from the GPUs? I see this in all OpenCL programs, not just my own.

                     

                    After realizing that both drivers insist on using 100% CPU cycles as long as the GPUs are working, I thought I could at least make the OpenCL kernels run smoothly even if it lags everything else on the computer. The idea was to queue several kernels at once. This way, even if the thread running a GPU didn't get any CPU cycles for a while, the GPU would still have queued kernels to run until the CPU thread could catch up and enqueue more.

                     

                    I tried this with one thread per OpenCL device (3 in all), first with several kernel invocations on the same queue, later with several queues with only one kernel on each queue.

                     

                    It works for the GeForce. As long as the CPU thread gets to run every once in a while it can keep plenty of work stacked up for the GPU to run at full speed.

                     

                    For the dual Radeons I was less lucky. There seems to be no way to get a kernel to start executing without also blocking my thread until the kernel finishes. Maybe someone can point me to a way, if there is one? With many kernels enqueued in the same queue, calling clFlush on the queue blocks until all of them finish.

                     

                    I think you are trying to run kernels on NV device and both devices from 6990. Do you see similar thing when you run kernel for only first device of 6990. The second device of 6990 is officially not supported.

                     

                     

                     

                      • How do I execute kernels without 100% CPU busy-wait?
                        Dr.Haribo

                         

                        Originally posted by: himanshu.gautam

                         

                        This is not a bug. OpenCL implementation always tries to club many enqueueCommands together to be sent to GPUs. So it is always advised to use clFlush/clFinish after any command which you want to start immidiately.

                         

                        I was getting the impression that clFlush() was waiting until the kernel finished executing. But I guess the kernel just happened to finish quickly.

                         

                        I think you are trying to run kernels on NV device and both devices from 6990. Do you see similar thing when you run kernel for only first device of 6990. The second device of 6990 is officially not supported.

                         

                         

                         

                        Yes, I'm trying to run kernels on the GTX 580 and both AMD Cayman GPUs. I see the same thing when only running on the first Cayman from the 6990.

                        I tried disabling the GTX 580 in device manager, but it made no difference. Then I tried disabling one GPU from the 6990 in device manager, but that caused both of them to disappear from OpenCL. I can't find any option in Catalyst Control Center to disable one Cayman, or turn off crossfire, or anything like that. But even if it would help, I have to keep in mind that if I write a program others will use, the ones who own a 6990 will want to use both GPUs.

                        Meanwhile, someone said on the NVIDIA forums that NVIDIA uses busy-waiting in their OpenCL implementation and there's no way around it. They also said if you use CUDA directly you can choose between CPU busy-wait and hardware interrupts to react to GPU events.

                        Is this how it works on AMD too? (always busy-wait under OpenCL) Or am I just one of the unlucky ones seeing a CPU core go to 100% when I start OpenCL work on a GPU?

                         

                          • How do I execute kernels without 100% CPU busy-wait?
                            ED1980

                            For me the same program at presence of only 1 GPU(HD6950) in the system works from a 1~5% loading of processor, at addition of second HD6950, every copy of the program loads one kernel of processor on 100%, even if one copy of the program is started only. And this at that, that if by hand to throw down all programs on one kernel of CPU, the productivity of the programs does not go down. I think problem in the driver of AMD showing up at appearance a few GPU...

                            The indicated problem is only under Windows, in Linux(Ubuntu), all works normally...