How do I execute kernels without 100% CPU busy-wait?

Discussion created by Dr.Haribo on May 21, 2011
Latest reply on May 28, 2011 by Dr.Haribo

OS: 64-bit Windows 7

CPU: Intel Core 2 Duo E8400

GPU 1: AMD Radeon 6990 (dual Cayman) with AMD_Catalyst_11.5a_Hotfix_8.85.6RC2_Win7_May13

GPU 2: nVidia GeForce GTX 580 with 270.61 drivers (2011.04.18)

An admittedly aging CPU with two cores driving three top-of-the-line GPUs. But still, I don't see why it would take 100% CPU power to run a SHA-256 hashing kernel - a compute intensive task with very little data to transfer to/from the CPU. Actual CPU usage should be more like 0.01%.

I am having problems keeping all three GPUs running full speed and no matter what I try I cannot get the CPU load down.

As any blocking OpenCL call triggers 100% CPU usage, I tried polling an event on the running kernel with thread sleeps in between. I ran into two problems:

1. As soon as I enqueue a kernel it starts to run on the GeForce, but on the Radeon it just sits there with state CL_QUEUED indefinitely, unless I call some blocking operation. clWaitForEvents, clFinish, clFlush will get things moving, but they block with a busy-wait.

This is odd, because on page 27 (1-13) I find this:

"Unless the GPU compute device is busy, commands are executed immediately."

in the following document:


Is it a bug?

2. Even with all my threads asleep, there is no improvement. There's always a thread in amdocl.dll and nvcuda.dll eating all available CPU cycles. Do both drivers always busy-wait to detect events from the GPUs? I see this in all OpenCL programs, not just my own.

After realizing that both drivers insist on using 100% CPU cycles as long as the GPUs are working, I thought I could at least make the OpenCL kernels run smoothly even if it lags everything else on the computer. The idea was to queue several kernels at once. This way, even if the thread running a GPU didn't get any CPU cycles for a while, the GPU would still have queued kernels to run until the CPU thread could catch up and enqueue more.

I tried this with one thread per OpenCL device (3 in all), first with several kernel invocations on the same queue, later with several queues with only one kernel on each queue.

It works for the GeForce. As long as the CPU thread gets to run every once in a while it can keep plenty of work stacked up for the GPU to run at full speed.

For the dual Radeons I was less lucky. There seems to be no way to get a kernel to start executing without also blocking my thread until the kernel finishes. Maybe someone can point me to a way, if there is one? With many kernels enqueued in the same queue, calling clFlush on the queue blocks until all of them finish.

The answer seems to be to keep several CPU threads for each GPU. Each of the threads with its own queue, pushing one kernel invocation at a time at the GPU. I guess I will try that next.

I already spent too much time on this, and I'm thinking that this is too silly - surely it must be me doing it wrong. What's the proper way to deal with this?