I have few questions about this profiler rresults:
1) How is possible to get simultaneous writes in 2 queues? (to same device). There are 3 writes that overlap in time to the same device.
2) why so big gap between 2 kernel executions? What prevents to execute next kernel? And, as result of that gap, why so big ReadBuffer API call time?
EDIT: here write overlapping seen more clearer:
And another not quite clear picture:
there are buffer read and map both 4kb size.
read performed in usual read-write buffer while mapping uses host pinned memory. Before I saw (via command line sprofile 2.3) considerable speedup in such map operation cause it's zero copy. But here both read and map take ~same (and very big) amount of time...
Can you share your atp file through AMD help desk (http://developer.amd.com/support/KnowledgeBase/pages/HelpdeskTicketForm.aspx?Category=7) and let us know after you have submitted it.
Ok, I will try to find that session and upload it.
For now another "funny" picture from my poor C-60
Speed of GPU->GPU (same single GPU in system) just unbelievable. My kernall just hidden between these 8kb copies. ~37kb per second - speed of light
What could cause this? No another opencl programs in background. Couls some Flash plugin in background create such effect? Can it be hugely overloaded bus (and how it could become so overloaded??? ) Is it possible that I've seen effect of GPU memory swapping under Win7 WDM driver?
Please note, it's "inside GPU" transfer, not host<-> GPU transfer. So it's not overloaded PCI-e bus but memory buc in C-60 APU...
For comparison: same fragment of code under normal conditions:
Here memory copy takes just small fraction of elapsed time, most time taken by kernels...