I'm transferring some data from CPU memory to GPU. I'm using pre-pinned buffers in order to avoid intermediate copying to temporary pinned buffers. The buffer is allocated by calling clCreateBuffer(CL_MEM_USE_HOST_PTR) for a 4096-aligned pre-allocated memory block. After the buffer is created, it's mapped by calling clEnqueueMapBuffer. Transfer itself is done by calling clEnqueueWriteBuffer. This is all done as suggested in section 18.104.22.168 of the OpenCL Programming Guide (Option 1).
The transfer works fast only if I copy data that starts at the beginning of the memory block that was passed to clCreateBuffer. When however I need to copy a part of that block (say 4MB is allocated and only 1MB needs to be transferred starting with offset 500Kb), the transfer works slow (at the same speed as clEnqueueWriteBuffer for a non-pinned region, even a bit slower). Transferring in the opposite direction (GPU -> CPU) works good with zero and non-zero offsets.
An alternative approach -- using clCopyBuffer as described in option 2 of the section 22.214.171.124 -- performs good, regardless of whether offset is zero or no, however, it's a bit less convenient to use.
Is it the expected behaviour? If no, is it going to be fixed?