I've been doing some performance measurements on a Fusion GPU using OpenCL events. The chip is an A8-3850 APU with Catalyst 11.10 and APP SDK 2.5-RC2 on OpenSUSE Linux.
I collect all 4 profiling data (queued, submit, start, end) for a single OpenCL kernel with different input sizes. "queued-to-submit" takes about 60-80 microseconds in all cases (which seems normal). However, "submit-to-start" takes longer for larger inputs and is always more or less equal to "start-to-end" (between 0.3 and 4.2 seconds).
I did exactly the same experiment on a machine with an Radeon HD5970 (same software setup). This time "submit-to-start" only takes around 500 microseconds for all inputs and "start-to-end" grows from 0.2 to 2.1 seconds.
The results on the Radeon GPU make sense, but the profiling on the APU seems broken... Or has anyone got another explanantion?