I've written a library that creates abstractions for buffers to reduce the complexity of using memory on the GPU in CAL. Specifically, this done by creating a CALbuffer class whose constructor calls either calResAllocLocal1D or calResAllocLocal2D depending which dimensionality you specify. Once the constructor has been called, you call foo->readBuffer(memory) to copy data to that construct on the GPU.
readBuffer(void* buffer) is implemented by checking pitch alignment versus the x dimension of the buffer. If they are the same, it essentially uses a single call to memcpy to move data into the PCI zone. If the x dimension is not pitch aligned, it makes several calls to memcpy.
void calutil::CALbuffer::readBuffer(void* buffer, CALuint flag)
calResMap((CALvoid**)&this->cpuBuffer, &this->pitch, this->resource,
if(this->pitch == this->dim1)
copyBitsAlignedPitch(this->cpuBuffer, buffer, this->dim1, this->dim2,
copyBitsUnalignedPitch(this->cpuBuffer, buffer, this->dim1, this->dim2,