cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

chevydevil
Adept II

Uploading small chunks of data very slow on Radeon 7970

Hello.

I have two buffers created with CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR

Then I upload chunks of 1, 2 or 4 KB of Data in a while loop. The chunks can't be bigger for several reasons. All over there are up to 1024 chunks. I'm using clEnqueueWriteBuffer() with an offset to put the data in the right place. On a Radeon 7970 connected with PCIe 2.0 this takes 10 ms per chunk! This works about 40 times faster on a GTX 580 in the same system and even on a GTX 280 M it is much faster. What is wrong there? I'm using Ubuntu 12.04 with the latest beta driver and AMD APP 2.8. Attached is the code for the upload.

P.S.: The wrapper does nothing more then calling non-blocking clEnqueueWriteBuffer() with the offset.

void

TreeletMemoryManagerCl::updateDeviceMemory()

{

  // collect incore buffer ranges to upload

  gloost::bencl::ClBuffer* incoreClBuffer = _clContext->getClBuffer(_svoClBufferGid);

  gloost::bencl::ClDevice* device         = _clContext->getDevice(0);

  std::set<gloost::gloostId>::iterator slotGidIt    = _incoreSlotsToUpload.begin();

  std::set<gloost::gloostId>::iterator slotGidEndIt = _incoreSlotsToUpload.end();

  if (!_incoreSlotsToUpload.size())

  {

    return;

  }

  while (slotGidIt!=slotGidEndIt)

  {

    // svo data

    unsigned srcIndex         = (*slotGidIt)*_numNodesPerTreelet;

    unsigned destOffsetInByte = (*slotGidIt)*getTreeletSizeInByte();

    int status = incoreClBuffer->enqueueWrite( device->getClCommandQueue(),

                                                false,

                                                destOffsetInByte,

                                                getTreeletSizeInByte(),

                                                (const char*)&(_incoreBuffer[srcIndex]));

    // attrib data

    unsigned attribSrcIndex         = (*slotGidIt)*_numNodesPerTreelet*_incoreAttributeBuffer->getNumElementsPerPackage();

    unsigned attribDestOffsetInByte = (*slotGidIt)*_attributeBuffers[0]->getVector().size()*sizeof(float);

    gloost::bencl::ClBuffer* incoreAttributeClBuffer = _clContext->getClBuffer(_attributeClBufferGid);

    status = incoreAttributeClBuffer->enqueueWrite(device->getClCommandQueue(),

                                                   false,

                                                   attribDestOffsetInByte,

                                                   _attributeBuffers[0]->getVector().size()*sizeof(float),

                                                   (const char*)&(_incoreAttributeBuffer->getVector()[attribSrcIndex]));

    ++slotGidIt;

  }

  clFinish( device->getClCommandQueue() );

  _incoreSlotsToUpload.erase(_incoreSlotsToUpload.begin(), slotGidIt);

}

0 Likes
1 Solution
chevydevil
Adept II

I got it! I had to use CL_MEM_COPY_HOST_PTR in combination with CL_MEM_ALLOC_HOST_PTR

I think this enabled the pinned memory and now its lightning fast.

View solution in original post

0 Likes
2 Replies
chevydevil
Adept II

I got it! I had to use CL_MEM_COPY_HOST_PTR in combination with CL_MEM_ALLOC_HOST_PTR

I think this enabled the pinned memory and now its lightning fast.

0 Likes

refer to AMD OpenCL Programing Guide where you can find which flag enable what behavior.

0 Likes