Though this driver solves bug described here:
Kernel with local memory usage gives different results on some hardware
it introduced another issue - almost 2-fold increase in CPU consumption for my app.
It was shown that CPU consumption can be considerable decreased using Sleep() calls in some places inside app. But such strategy leads to wast increase in elapsed time (and lowers GPU usage of course).
That is, most probably in this driver new synching added (that solves prev bugs) but this synching implemented as CPU busy-wait loops that show themselves when app calls any or some blocking OpenCL functions. If delay from kernel/memory transfer enqueue and blocking call artifically added (as Sleep(1) call does, for example) that busy-wait loop lasts much shorter and CPU time consumption decreases.
The question are:
1) is AMD driver team aware of CPU consumption that their current approach to inner synchronization incurs?
2) can we hope that this CPU overhead will be reduced in release driver version?
3) What recommended ways to reduce this CPU overhead on application level? (besides one I already experimentally found. This one leades to GPU performance degradation, especially for top crds).