Recently I changed algorithm in app to keep more data directly on GPU that caused considerable decrease in number of buffer mapping (and each buffer mapping was sync point also).
I expected improve in run time due to increased GPU load and decrease in CPU time also due to less work for CPU to do.
But almost all what I got is sharp increase in CPU time. If before change CPU time constituted only small part of elapsed time, now CPU time almost equal elapsed. That is, almost 100% of CPU usage during whole app run.
I tried to avoid such CPU usage putting working thread into sleep before sync points - no success. While logs show that app reads event and sleep until it get corresponding status, CPU time not decreased.
So, I used ProcessExplorer to find who is consuming CPU. It was AMD driver thread with next stack:
picture clearly shows that this thread is main CPU consumer for the app's process:
So the question is - how to avoid such behavior? It seems that w/o big number of sync points between GPU and host code AMD driver goes mad and starts to use whole CPU core for own needs.
EDIT: Here are illustrations with CodeXL TimeLine pics how timeline looked before
and with marker event and Sleep(1) loop until event reached
As one can see (profiling done on C-60 APU but discrete HD6950 under different drivers shows same CPU usage pattern) there is quite big time interval of ~80s where GPU works on its own w/o synching with host. And it's place where AMD driver starts to consume whole CPU core.
EDIT2: it's very similar to issue described here: Re: Cat13.4: How to avoid the high CPU load for GPU kernels? by .Bdot
Any new cures since 2013 year?..
I'm not sure about the issue. Need to check with the driver/runtime team. Could you please provide a reproducible test-case? Or may I take the test-case from here Re: Cat13.4: How to avoid the high CPU load for GPU kernels? , if still applicable to you? Please specify if I need any particular H/W or driver to reproduce the issue.
Since post I implemented something similar to workaround mentioned in 2-years old thread, just more simple one - to put clFinish after each Nth algorithm iteration.
And it worked as expected:
WU : PG1327_v7.wu
Elapsed 72.004 secs
CPU 68.360 secs
Elapsed 72.868 secs
CPU 17.457 secs
MB7_win_x86_SSE3_OpenCL_ATi_APU_r2935_sync_iter.exe -sync_iter 1 :
Elapsed 73.258 secs
CPU 18.159 secs
MB7_win_x86_SSE3_OpenCL_ATi_APU_r2935_sync_iter.exe -sync_iter 2 :
Elapsed 73.070 secs
CPU 17.613 secs
MB7_win_x86_SSE3_OpenCL_ATi_APU_r2935_sync_iter.exe -sync_iter 3 :
Elapsed 73.071 secs
CPU 20.764 secs
MB7_win_x86_SSE3_OpenCL_ATi_APU_r2935_sync_iter.exe -sync_iter 4 :
Elapsed 73.086 secs
CPU 37.908 secs
MB7_win_x86_SSE3_OpenCL_ATi_APU_r2935_sync_iter.exe -sync_iter 5 :
Elapsed 72.977 secs
CPU 42.713 secs
As one can see, putting clFinish() after each 2 iterations (each iteration includes few kernels including FFT ones, so not just 1-2 kernels) CPU consumption returned to original one ("non_SoG" build).
But also, no any speedup for new build with such conditions.
So, my little test just validated the claim of old thread - if queue grows long enough driver starts to consume CPU in enormous quantities.
I think test case from old thread will demonstrate this just as good as my app.
And I tested this behavior also with Windows 10 x64 + AMD drivers it installed. I suppose it should be "latest available", right?
So, issues exists in latest AMD drivers.
As I ran that test-case, I got similar observation. However, the number of enqueued kernels and CPU percentage seem vary with the hardware configuration. Not sure whether there is any limitation or not. I'll check with driver team and if needed, file a bug report against this issue.
Thanks for pointing to the old thread.
Read-back of the exact stuff you're waiting for now.
Except that instead of blocking on map or polling for event, you sleep on the event, possibly doing a flush (but never finish!).
Also keep in mind that some hw / drivers / OS combinations have issues with very long tasks. I have difficulty understanding how a properly written kernel can take 80s to run.
Those drivers/OS will trip a watchdog, kill the CL context. Events will become invalid and wait calls on them will error.
In my experience this shouldn't be happening on Win7, let alone 10 and since it is my understanding you're not waiting on event that shouldn't be happening.
But since I had the same symptom in the past, I'm interested in checking if there could be a connection.
Please look pics attached to initial post to better understand execution chain in question.
There is no 80s kernel, there is 80s chain of many kernels. And CPU load occurs long before any synching occured.
As i stated in initial post app worker thread put into sleep but driver thread uses CPU.
Considering actrually described situation instead of some other model one in answer will help a lot, thanks.
Another workaround for this issue (only "workaround" cause its implementation did not improved overall performance) is to use 2 queues for that chain of kernels and link them via events where needed. This way no blocking sync with host required, only some ordering on device. And nevertheless this decreased CPU consumption once more proving that excessive CPU usage some buggy condition in runtime/driver.
CPU time consumption dropped in this case to level only slightly above initial one (where blocking mapping of flags was used).
As I understand using 2 queues could allow kernels pairing on device that could improve performance in theory.
In reality I don't see any performance improvement on HD6950 GPU, unfortunately. Maybe results for GCN would differ?...