Archives Discussions

rick_weber · ‎01-20-2009

I've written a library that creates abstractions for buffers to reduce the complexity of using memory on the GPU in CAL. Specifically, this done by creating a CALbuffer class whose constructor calls either calResAllocLocal1D or calResAllocLocal2D depending which dimensionality you specify. Once the constructor has been called, you call foo->readBuffer(memory) to copy data to that construct on the GPU.

readBuffer(void* buffer) is implemented by checking pitch alignment versus the x dimension of the buffer. If they are the same, it essentially uses a single call to memcpy to move data into the PCI zone. If the x dimension is not pitch aligned, it makes several calls to memcpy.

void calutil::CALbuffer::readBuffer(void* buffer, CALuint flag)

{

calResMap((CALvoid**)&this->cpuBuffer, &this->pitch, this->resource,

flag);

if(this->pitch == this->dim1)

{

copyBitsAlignedPitch(this->cpuBuffer, buffer, this->dim1, this->dim2,

this->pitch, this->elementSize);

}

else

{

copyBitsUnalignedPitch(this->cpuBuffer, buffer, this->dim1, this->dim2,

this->pitch, this->elementSize);

}

calResUnmap(this->resource);

}

So, to get to the point, this method yields terrible throughput (~100MB/s). Is there something I can do to speed up this transfer? The performance is anecdotally about half of transferring data in Brook+. In my case, there are 32 buffers of size 512x8192 being transferred. To improve performance, do I need to map 8 buffers simultaneously?

MicahVillmow · ‎01-20-2009

Two questions,
1) is this done at the CAL level or the brook level?
2) is the performance bottleneck between user space and pci space or is it between pci space and gpu space?

rick_weber · ‎01-21-2009

1) I've written high level abstractions that use CAL API calls.

2) The performance bottleneck is on calling memcpy(). Basically, the high level abstraction calls calResMap followed by memcpy followed by calResUnmap.

calResUnmap takes 1/4 the amount of time as the memcpy, so I the bottleneck is userspace to PCI space. I guess what I wanted to know is how brook+ attains high performing data transfers? Does it have a custom memcpy function?

rahulgarg · ‎01-21-2009

Try another approach. Lets say you have a local resource L. Allocate a CAL remote resource R with CPU cacheable flag, map R to cpu (which is instant operation), memcpy to R, unmap R (instant operation), use calMemCpy to copy R to L.

rahulgarg · ‎01-21-2009

Also memcpy speed depends on ur OS and C library etc. Check the speed of memcpy by doing memcpy between 2 regular C arrays first?

MicahVillmow · ‎01-21-2009

rick,
In tests that we have done, memcpy implementation and PCI chipset greatly affect the performance of memory transfers. We have seen memcpy implementations run an order of magnitude slower than other memcpy implementations. The only real solution to this is to use calUserRes and bypass that initial memcpy all together. Depending on the data size and other factors you can see ~10-~100% improvement in overall system performance.

rick_weber · ‎01-21-2009

Thanks for your replies. So, I'm trying to make a generalized abstraction for handling data on the GPU and I'm trying to make it efficient. Unfortunately, in my application that this library considers, there is a huge amount of data to be transfered and so I can't really use remote memory, since that is limited to around 16MB or so on my platform.

gaurav_garg · ‎01-21-2009

You can do tile-by-tile copy using a copy shader from remote cacheable resource to GPU local resource. Take a look inside Brook+ source code to see how it is done.

rick_weber · ‎01-21-2009

Would another solution be to use the calCreateRes*D() extension functions to create a resource mapped to the buffer the user actually wants their data to appear and then call calMemCopy() to move the data from the GPU local buffer to the buffer they want (with the alignment and size assumptions that calCreateRes*D requires)? Doing this should in theory place data in the buffer they want and avoid calling memcpy entirely.

gaurav_garg · ‎01-22-2009

But, you still face the same problem of not able to allocate big pinned resource. IIRC, amount of pinned memory available is also about 16 MB.

rick_weber · ‎01-22-2009

Yes, I recalled that about halfway through implementing it as the intermediary between CPU and GPU It was mentioned that you all were working with the drivers team to up that limit. Is that coming anytime soon, or still a ways off? Either way, there may be around this. I looked at how Brook+ implements stream writes and saw that it uses a shader to do copying. So, I could map to a buffer and run a shader to copy part of the local stream, unmap it, map it again to the same buffer plus an offset and run the shader over the rest of the buffer.

Archives Discussions

Poor PCI-X utilization in CAL memory transfers