And another not quite clear picture:
there are buffer read and map both 4kb size.
read performed in usual read-write buffer while mapping uses host pinned memory. Before I saw (via command line sprofile 2.3) considerable speedup in such map operation cause it's zero copy. But here both read and map take ~same (and very big) amount of time...
Maybe you just used two graphics cards or more. Could you please offer some more information about this, such as the session file of app profiler.
Logs were aquired on C-60 APU with APP Profiler 2.4.
Can you share your atp file through AMD help desk (http://developer.amd.com/support/KnowledgeBase/pages/HelpdeskTicketForm.aspx?Category=7) and let us know after you have submitted it.
Ok, I will try to find that session and upload it.
For now another "funny" picture from my poor C-60
Speed of GPU->GPU (same single GPU in system) just unbelievable. My kernall just hidden between these 8kb copies. ~37kb per second - speed of light
What could cause this? No another opencl programs in background. Couls some Flash plugin in background create such effect? Can it be hugely overloaded bus (and how it could become so overloaded??? ) Is it possible that I've seen effect of GPU memory swapping under Win7 WDM driver?
Please note, it's "inside GPU" transfer, not host<-> GPU transfer. So it's not overloaded PCI-e bus but memory buc in C-60 APU...
For comparison: same fragment of code under normal conditions:
Here memory copy takes just small fraction of elapsed time, most time taken by kernels...