Did you warm up the runtime before doing that timing?
Ie do something, another copy, maybe enqueue a kernel. Do a clFinish. Then do the image copy. See if that makes any difference?
No I didn't but i did it now:
Actually I uploaded the image and wrote to the buffer 10 times before profiling it.
These are the results:
Start: upload image
Stop: upload image (0.223629s)
Start: upload buffer
Stop: upload buffer (0.060787s)
Start: copy buffer to image
Stop: copy buffer to image (0.301985s)
code now looks like this:
for ( ... i < 10 ...) // warmup
p.start("copy buffer to image");
Image: 3168*4752*4 / 0.223629s = 0.269273413 GB/s
Buffer: 3168*4752*4 / 0.060787s = 0.990628654 GB/s
Copy Buffer -> Image: 3168*4752*4 / 0.301985s = 0.199405083 GB/s
Those enqueues at the top are blocking? They need to actually push things through the queue rather than just sit waiting. Best to do a finish and wait just after the warmup in code like this.
Other than that maybe there is a problem with image upload performance under linux. I will enquire.