I've written a library that creates abstractions for buffers to reduce the complexity of using memory on the GPU in CAL. Specifically, this done by creating a CALbuffer class whose constructor calls either calResAllocLocal1D or calResAllocLocal2D depending which dimensionality you specify. Once the constructor has been called, you call foo->readBuffer(memory) to copy data to that construct on the GPU.
readBuffer(void* buffer) is implemented by checking pitch alignment versus the x dimension of the buffer. If they are the same, it essentially uses a single call to memcpy to move data into the PCI zone. If the x dimension is not pitch aligned, it makes several calls to memcpy.
void calutil::CALbuffer::readBuffer(void* buffer, CALuint flag)
{
calResMap((CALvoid**)&this->cpuBuffer, &this->pitch, this->resource,
flag);
if(this->pitch == this->dim1)
{
copyBitsAlignedPitch(this->cpuBuffer, buffer, this->dim1, this->dim2,
this->pitch, this->elementSize);
}
else
{
copyBitsUnalignedPitch(this->cpuBuffer, buffer, this->dim1, this->dim2,
this->pitch, this->elementSize);
}
calResUnmap(this->resource);
}
1) I've written high level abstractions that use CAL API calls.
2) The performance bottleneck is on calling memcpy(). Basically, the high level abstraction calls calResMap followed by memcpy followed by calResUnmap.
calResUnmap takes 1/4 the amount of time as the memcpy, so I the bottleneck is userspace to PCI space. I guess what I wanted to know is how brook+ attains high performing data transfers? Does it have a custom memcpy function?
Thanks for your replies. So, I'm trying to make a generalized abstraction for handling data on the GPU and I'm trying to make it efficient. Unfortunately, in my application that this library considers, there is a huge amount of data to be transfered and so I can't really use remote memory, since that is limited to around 16MB or so on my platform.
You can do tile-by-tile copy using a copy shader from remote cacheable resource to GPU local resource. Take a look inside Brook+ source code to see how it is done.
Would another solution be to use the calCreateRes*D() extension functions to create a resource mapped to the buffer the user actually wants their data to appear and then call calMemCopy() to move the data from the GPU local buffer to the buffer they want (with the alignment and size assumptions that calCreateRes*D requires)? Doing this should in theory place data in the buffer they want and avoid calling memcpy entirely.
But, you still face the same problem of not able to allocate big pinned resource. IIRC, amount of pinned memory available is also about 16 MB.
Yes, I recalled that about halfway through implementing it as the intermediary between CPU and GPU It was mentioned that you all were working with the drivers team to up that limit. Is that coming anytime soon, or still a ways off? Either way, there may be around this. I looked at how Brook+ implements stream writes and saw that it uses a shader to do copying. So, I could map to a buffer and run a shader to copy part of the local stream, unmap it, map it again to the same buffer plus an offset and run the shader over the rest of the buffer.