Regrading GPU app increased sensitivity to CPU load: try to set your GPU app affinity to any single core. In our experiments it helps a lot.
Very interesting observation regarding number of enqueued kernels and increase in CPU load.
Could you please specify do you use clFlush() between or only enqueuing call ?
with the latest drivers (13.11beta9v2), the sensitivity to high CPU load has dropped significantly, that's at least what I see now, but I had to change my kernels quite a bit to make them work on this driver (don't use atomics anymore - they keep killing the driver, or causing blue screens). So I cannot say for sure that this part is OK now, but a test program has a performance loss of less than 1 % when all 4 CPUs are fully loaded, compared to an idle system.
BTW, when the cores are already at full load working on something else, then the high number of enqueued kernels just causes 1-3%CPU load. Only if there is CPU available, it will consume it.
The high CPU load occurs no matter if I use clFlush or not. I have two interlocked kernels, and I run clFlush after the first (and clFinish after the loop). But enqueueing all kernels at once and only running clFinish once is the same.
Did you have a chance to test it on a discrete GPU? While experimenting with longer-running kernels I noticed that the high CPU load already kicks in at 8 kernels "in flight".
I see, thanks. And regarding increased CPU consumption when idle - yes, I see the same om C-60 based netbook with 12.8 (?) Mobility Drivers (hard to tell what drivers are in reality, GPU-Z fails to say Catalyst number).
With idle CPU CPU times on benchmark much higher than on loaded system (though elapsed time less).
P.S did not check latest beta so far, but with released 13.4 drivers my app experience slowdown comparing with 12.8 drivers. This slowdown it seems comes completely from worse OpenCL compiler though.
When I use pre-compiled binaries that were generated under Cat 12.8 with Cat 13.4 there is no performance drop. Strange compiler degradation
Have you tried that dumb but effective workaround when
- compact the job into a very long kernel (lets say 500ms)
- set up 2 contexts
- only one job per context
First you start a job on ctx 1, then sleep until that job almost finishes, and they start the next job on ctx2, and you can read the results from ctx1. Here you can sleep again until ctx 2 finishes and start a new one on ctx1 before processing ctx2's results.
99 percent gpu utilization and 1 percent cpu usage. It works even on the old cal drivers where 100 percent cpu was guaranteed if you had more than 1 jobs per ctx.
But unfortunately it requires simple kernels with predictable running times.
this would work but does not fit my use case very well (I have difficulties predicting the kernel run time). The workaround I have implemented now adds an event to every 3rd pair of kernels and then waits for completion before scheduling the next 3 pairs. Using this, the total performance drops by ~1%, but the CPU is kept free. I will try adding the event to the second of 3 pairs, wait for the event, and then again schedule 3 pairs with the event on the second. This should generate enough overlapping to get back the 1% performance 😉