It looks like the events are not performing properly, or maybe you are not using them properly. Can you provide some testcase where this discriminating timing issue can be reproduced.
Also mention your SDK,Catalyst,Os etc
I guess memory bound kernels may be able to produce more performance with APUs and memory bandwidth is more here.
I assume you are testing on a Llano-based APU. Keep in mind that while system memory bandwidth is likely lower than a discrete GPU's local memory bandwidth, if your algorithm makes good use of the caches in the APU, then performance can still be quite good.
I don't have any insight into what your particular test is doing, but there's more to performance than just external memory bandwidth, even if your test is bandwidth limited.