This is odd; I have a few relatively small intermediate buffers that I used to create on the device only.
I will now be needing to process these buffers on the host, so I tried creating them with pinned host memory,
in preparation to map them.
For some strange reason, performance has improved significantly.
I'm not one to look a gift horse in the mouth, but can anyone think of an explanation for this phenomenon ?
Solved! Go to Solution.
Well, this turned out to be a red herring. Perf increase was not related to using pinned memory, which makes sense since
I was not actually transferring memory to host.
If you have a kernel with very high arithmetic intensity writing to the result buffer once AND
you were forcing full sync stop-n-wait style between CPU-GPU then the limiting factor was possibly the address space transfer/translation latency.
In that case, having the buffers in host memory directly could improve performance.
If you didn't map the buffers previously, the only explanation could be exceeding VRAM but this does not seem very sound.
See also: Mapping device memory for more discussion on the diff between pinned and mapped memory.
Well, this turned out to be a red herring. Perf increase was not related to using pinned memory, which makes sense since
I was not actually transferring memory to host.