10 Replies Latest reply on Jan 22, 2009 3:27 AM by rick.weber

    Poor PCI-X utilization in CAL memory transfers

    rick.weber

      I've written a library that creates abstractions for buffers to reduce the complexity of using memory on the GPU in CAL. Specifically, this done by creating a CALbuffer class whose constructor calls either calResAllocLocal1D or calResAllocLocal2D depending which dimensionality you specify. Once the constructor has been called, you call foo->readBuffer(memory) to copy data to that construct on the GPU.

      readBuffer(void* buffer) is implemented by checking pitch alignment versus the x dimension of the buffer. If they are the same, it essentially uses a single call to memcpy to move data into the PCI zone. If the x dimension is not pitch aligned, it makes several calls to memcpy.

       

      void calutil::CALbuffer::readBuffer(void* buffer, CALuint flag)

      {

        calResMap((CALvoid**)&this->cpuBuffer, &this->pitch, this->resource, 

          flag);

        if(this->pitch == this->dim1)

        {

          copyBitsAlignedPitch(this->cpuBuffer, buffer, this->dim1, this->dim2,

            this->pitch, this->elementSize);

        }

        else

        {

          copyBitsUnalignedPitch(this->cpuBuffer, buffer, this->dim1, this->dim2,

            this->pitch, this->elementSize);

        }

        calResUnmap(this->resource);

      }

      So, to get to the point, this method yields terrible throughput (~100MB/s). Is there something I can do to speed up this transfer? The performance is anecdotally about half of transferring data in Brook+. In my case, there are 32 buffers of size 512x8192 being transferred. To improve performance, do I need to map 8 buffers simultaneously?