If you have a kernel with very high arithmetic intensity writing to the result buffer once AND
you were forcing full sync stop-n-wait style between CPU-GPU then the limiting factor was possibly the address space transfer/translation latency.
In that case, having the buffers in host memory directly could improve performance.
If you didn't map the buffers previously, the only explanation could be exceeding VRAM but this does not seem very sound.
Well, this turned out to be a red herring. Perf increase was not related to using pinned memory, which makes sense since
I was not actually transferring memory to host.