There is some part in app I working on where big number of data arrays folded in different ways then data in resulting arrays compared with threshold, outcome of this comparison reported to host from GPU.
The sequence was: do transform on N arrays, do compare with threshold, read back to host data element per thread/array (results). This sequence was done let say M times.
Now I changed algorithm to make M transforms in N arrays w/o reporting back to CPU. Results are accumulated on GPU and only after full sequence completion reported (again, via blocking ReadBuffer) to host.
Cause there is much less sync points in second way I expected performance increase. And yes, there is performance increase on high end or middle end GPUs.
But low end GPUs show very strange picture. There is huge increase in CPU time. Running under CodeAnalyst profiler showed that this increase can be attributed to more time spent in OS functions, app process own CPU usage stays the same.
So, what we have: for low end GPU switch from many "short" sync points (sync point always implemented as blocking ReadBuffer) to single "long" sync point leads to big increase in CPU time required by app. The amount of work performed inside kernels on GPU stays almost the same in both cases, only way of synching with host considerably changed. Second method has much better GPU load of course, but this CPU time increase makes overall performance gain questionable.
I would like to hear some explanation of this observation from AMD specialists. Maybe it would be possible to share some details how synchronization implemented inside runtime/drivers to let us write more effective applications? To reduce number of GPU<->host transfers is number one suggestion in any optimization guide for GPU computing but here it result in even performance drop maybe (because of increased CPU consumption).
And of course I tried to implement different synching in second case: Sleep(1) + polling event for completion. This decreased CPU consumption little but it still higher than in first case... (also, this increased elapsed time to such point where overall performance was decreased vs first approach).