1. The kernel gets deleted only when application exits or brook:: destroyAllDevices() method is called. StreamRead calls are asynchronous and Kernel has to wait for Stream::read to finish in case their is any dependency. You might be seeing that slowdown becuase of streamRead wait in your kernel invocation. You can should Stream::finsih() after Stream::Read() to measure correct timings.
2. Kernel call is asynchorous and it returns after sending kernel execution command to GPU. GPU would finish first kernel call before executing second even if the same kernel is called twice. Also, Brook+ will wait for first kernel to finish if second kernel has any dependency on any output stream of first.
Thanks gaurav, that answer a lot of things
For number 2, about kernel call if a kernel called twice, will the calls spooled in the hardware? Or CPU keep wait until device gives ready signal? Or CPU just calls rapidly and when device ready, the device execute the command then tells CPU that the call is executed?
As for near full memory performance, I think the near full is exactly the culprit. I've read OMM gets around 540GFlops in 4096x4096 on 4870 (dunno how much gigs)
2. Kernel call is asynchorous and it returns after sending kernel execution command to GPU.
Will it depend from domain size then? I see very big variation in kernel call times when domain size changes...