rick.weber

Poor PCI-X utilization in CAL memory transfers

Discussion created by rick.weber on Jan 20, 2009
Latest reply on Jan 22, 2009 by rick.weber

I've written a library that creates abstractions for buffers to reduce the complexity of using memory on the GPU in CAL. Specifically, this done by creating a CALbuffer class whose constructor calls either calResAllocLocal1D or calResAllocLocal2D depending which dimensionality you specify. Once the constructor has been called, you call foo->readBuffer(memory) to copy data to that construct on the GPU.

readBuffer(void* buffer) is implemented by checking pitch alignment versus the x dimension of the buffer. If they are the same, it essentially uses a single call to memcpy to move data into the PCI zone. If the x dimension is not pitch aligned, it makes several calls to memcpy.

 

void calutil::CALbuffer::readBuffer(void* buffer, CALuint flag)

{

  calResMap((CALvoid**)&this->cpuBuffer, &this->pitch, this->resource, 

    flag);

  if(this->pitch == this->dim1)

  {

    copyBitsAlignedPitch(this->cpuBuffer, buffer, this->dim1, this->dim2,

      this->pitch, this->elementSize);

  }

  else

  {

    copyBitsUnalignedPitch(this->cpuBuffer, buffer, this->dim1, this->dim2,

      this->pitch, this->elementSize);

  }

  calResUnmap(this->resource);

}

So, to get to the point, this method yields terrible throughput (~100MB/s). Is there something I can do to speed up this transfer? The performance is anecdotally about half of transferring data in Brook+. In my case, there are 32 buffers of size 512x8192 being transferred. To improve performance, do I need to map 8 buffers simultaneously?


Outcomes