cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

frankas
Journeyman III

Deadlock? (hang) when reading from pinned memory

I am trying to improve performance on a currently working stream application, by moving to pinned memory streams. But after a short while my thread that handles Brook calls hangs forever in a mutex lock like this:

Thread 2 (Thread 0xb7ab6b90 (LWP 21941)):
#0  0xb7f4642e in __kernel_vsyscall ()
#1  0xb7f22cf9 in __lll_lock_wait () from /lib/tls/i686/cmov/libpthread.so.0
#2  0xb7f1e129 in _L_lock_89 () from /lib/tls/i686/cmov/libpthread.so.0
#3  0xb7f1da32 in pthread_mutex_lock () from /lib/tls/i686/cmov/libpthread.so.0
#4  0xb7268d2b in brook::ThreadLock::lock () from /usr/lib/libbrook.so
#5  0xb72a80c6 in CALBuffer::initializePinnedBuffer () from /usr/lib/libbrook_cal.so
#6  0xb729ac64 in CALBufferMgr::_createPinnedBuffer () from /usr/lib/libbrook_cal.so
#7  0xb729bf07 in CALBufferMgr::setBufferData () from /usr/lib/libbrook_cal.so
#8  0xb725a093 in StreamImpl::read () from /usr/lib/libbrook.so
#9  0xb7c0b20c in brook::StreamData::read () from /usr/lib/libbrook_d.so
#10 0xb7c5dce9 in brook::Stream<uint4>::read (this=0x9e43960, ptr=0x9e54900, flags=0xb7c71c99 "nocopy")
    at /usr/local/atibrook/sdk/include/brook/StreamDef.h:160
#11 0xb7c5b49c in A5Slice::tick (this=0x9b223c8) at A5Slice.cpp:366
#12 0xb7c4b5c2 in BrookA5:rocess (this=0x9b25870) at A5Brook.cpp:139
#13 0xb7c4b637 in BrookA5::thread_stub (arg=0x9b25870) at A5Brook.cpp:52
#14 0xb7f1c4ff in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#15 0xb7e5249e in clone () from /lib/tls/i686/cmov/libc.so.6

When this first happened I issued 18 async read calls, I tried serializing the read operations with isSync calls, but the result is the same. Also it does not appear to be a general race condition, as the hang occurs after the exact same number of kernel invocations.

Since this behaviour is highly reproducible I managed to set a breakpoint in pthread_lock just prior to the read call that I know will fail (trying to see who else takes the lock) However what I observe is a large amount of buffer destructors beeing called like this:

#11 0xb7303da7 in calResFree () from /usr/lib/libaticalrt.so
#12 0xb7344c01 in CALBuffer::~CALBuffer () from /usr/lib/libbrook_cal.so
#13 0xb7337c21 in CALBufferMgr::_createPinnedBuffer () from /usr/lib/libbrook_cal.so
#14 0xb7338f07 in CALBufferMgr::setBufferData () from /usr/lib/libbrook_cal.so
#15 0xb7c97093 in StreamImpl::read () from /usr/lib/libbrook.so
#16 0xb7ca820c in brook::StreamData::read () from /usr/lib/libbrook.so
#17 0xb7cfa929 in brook::Stream<uint4>::read (this=0x9349368, ptr=0x935a300, flags=0xb7d0e8d8 "nocopy")
    at /usr/local/atibrook/sdk/include/brook/StreamDef.h:160
#18 0xb7cf819e in A5Slice::tick (this=0x90283c8) at A5Slice.cpp:369

This seems to indicate that the pinned buffers are accumulated in GFX memory and are only occasionally flushed. When this flusing occurs someone forgets to realease the mutex, and the next create call hangs indefinelty.

Where can I find the libbrook sources ? - I tried installing 1.4.1 but it fails on Ubuntu (on of the legacy samples has a dependancy on an old libpthread) - but the shared library is the same as that found in 1.4.0 (checked md5 sum)

Frank

 

 

0 Likes
18 Replies
gaurav_garg
Adept I

Yes, I think this was a bug that was reproducible if multiple streamRead or streamWrite with "nocopy" option cause pinned memory allocation failure that in turn causes one particular work-flow where pthread lock is not freed.

I have just checked-in the fix in source forge version. The fix is in CALBuffer.cpp file. You need to build only CALRuntime library.

0 Likes

Thanks for the prompt reply and fix. I managed to build a version that doesn't exhibit this hang, but sundry other problems appears. Flushing the data buffers take quite a long time, and should in my opinion not be neccesary. I am constantly reusing the same 20 stream objects, and would expect memory to be reused on GPU side as well. Instead it runs out, causing noticable lag spikes, and degrading performance to a lower level than what I get with non-pinned buffers.

In adittion it looks like the flush may have corrupted buffers that are in the process of being read. (Not 100% sure about this though)

Frank

 

 

0 Likes

One easy solution is to edit file CALBufferMgr.cpp, function CALBufferMgr::_createPinnedBuffer and comment line 1161 if(!tmpBuffer->initializePinnedBuffer(cpuPtr, funcPtr)) and rebuild CALRuntime.

0 Likes

Originally posted by: gaurav.garg One easy solution is to edit file CALBufferMgr.cpp, function CALBufferMgr::_createPinnedBuffer and comment line 1161 if(!tmpBuffer->initializePinnedBuffer(cpuPtr, funcPtr)) and rebuild CALRuntime.

 

I am not convinced that this is the answer. It seem that _createPinnedBuffer() tries to maintain a cache _pinnedBufferCache of available buffers. But whereas  CALBufferMgr::_createHostBuffer() and CALBufferMgr::_createPCIeHostBuffer() actually (re) uses buffers in the cache, _createPinnedBuffer() never tries to look for available space in the cache list of buffers, and always ends up creating a new GPU resource, even when one is available. I think what is needed is a simple cache lookup that will return and reuse the buffers.

 

0 Likes

Brook+ runtime cannot reuse pinned cahced buffers because these buffers are specific to a pinned pointer passed in streamRead. So, everytime you need a pinned buffer, it has to be created specific to host ptr.

0 Likes

Originally posted by: gaurav.garg Brook+ runtime cannot reuse pinned cahced buffers because these buffers are specific to a pinned pointer passed in streamRead. So, everytime you need a pinned buffer, it has to be created specific to host ptr.

I have pretty well arrived at the conclusion that pinned streams doesn't really work at all in brook. I believe there may be another bug here, when the cache is flushed, it deletes all buffers, regardless of whether they are in use or not. I suspect this is the cause for the data corruption that I see.

One solution for me would be to use CAL, but I keep thinking that part of the problem in Brook, is that the buffers aren't explicitly tied to streams. Instead each time you read to a buffer, a (slow) sequential search is performed to locate a suitable temporary buffer (binary search on sorted cache would be faster) But if the GPU buffer was persistently tied to the lifetime of the stream object none of this would be needed. You would have to manage your streams more carefully to avoid unnecessary gobbling GPU memory.

Without such a scheme, pinned streams simply doesn't deliver the promised speed advantages.

regards, Frank

 

0 Likes

These pinned buffers are not tied to streams, they are just temporary buffers on host side for data transfer between host and GPU.

The buffers those are associated to streams are local buffer created in method BufferMgr::getBuffer() and these buffers are associated to Stream life time.

0 Likes

Originally posted by: gaurav.garg These pinned buffers are not tied to streams, they are just temporary buffers on host side for data transfer between host and GPU.

 

The buffers those are associated to streams are local buffer created in method BufferMgr::getBuffer() and these buffers are associated to Stream life time.

 

I don't suppose this is possible just using the ordinary brook API ? If so, could you please give a code example showing how to create a pinned Stream object with a persistant host buffer ?

Frank

 

0 Likes

Brook+ doesn't expose host buffers as done in CAL. Using "nocopy" option in Brook+ implies that a pinned memory is used for data transfer between host and GPU and that is going to be fast as you would be able to avoid an unnecessary memcpy between host ptr and CAL host buffer.

0 Likes

Originally posted by: gaurav.garg Brook+ runtime cannot reuse pinned cahced buffers because these buffers are specific to a pinned pointer passed in streamRead. So, everytime you need a pinned buffer, it has to be created specific to host ptr.

 

This is not what I am seeing. I call stream read with the same source pointers every time, but still new temporary buffers are created, and memory is evetually exhausted causing lagspikes / buffer corruption / hangs.

 

 

0 Likes

Originally posted by: gaurav.garg Brook+ runtime cannot reuse pinned cahced buffers because these buffers are specific to a pinned pointer passed in streamRead. So, everytime you need a pinned buffer, it has to be created specific to host ptr.

 

Sorry for harping on about this. But I simply can't find any line of code that tries to match source pointers of the buffers to those in the cache. I have tried adding some lines of code to do this, but it hasn't worked very well...

I have attached a patch that 1) speeds up the string matching 2) tries to match pinned buffers with existing ones in the cach. NB This patch is still work in progress, it is only included here to illustrate what I believ to be missing:

 

 

 

 

Index: runtime/CAL/Managers/CALBufferMgr.cpp =================================================================== --- runtime/CAL/Managers/CALBufferMgr.cpp (revision 340) +++ runtime/CAL/Managers/CALBufferMgr.cpp (working copy) @@ -41,6 +41,7 @@ #include "CALDevice.h" #include "StreamImpl.h" #include "Utility.h" +#include <string.h> //////////////////////////////////////////////////////////////////////////////// //! @@ -598,9 +599,7 @@ bool nocopy = false; if(flags) { - std::string flagString(flags); - size_t found = flagString.find("nocopy"); - if(found != std::string::npos) + if(strstr(flags,"nocopy")) { nocopy = true; } @@ -1157,18 +1156,54 @@ { CALBuffer* tmpBuffer = new CALBuffer(rank, dimensions, format, BUFFER_HOST, 0, _device); + // Look into cache if a free resource exist withe the same size and source + std::map<const void*, CALBuffer*>::iterator it = _pinnedBufferCache.find(cpuPtr); + + CALBuffer* sameSizedBuffer = NULL; + while(it!=_pinnedBufferCache.end()) + { + // Is size and format same? + if(*(it->second) == *tmpBuffer) + { + // if size is same look if it is being used for another data transfer + sameSizedBuffer = it->second; + CALevent copyEvent = sameSizedBuffer->getCopyEvent(); + CALevent inputEvent = sameSizedBuffer->getInputEvent(); + CALevent outputEvent = sameSizedBuffer->getOutputEvent(); + unsigned int id = sameSizedBuffer->getWriteThreadID(); + + if(copyEvent == inputEvent == outputEvent == 0 && id == 0) + { + delete tmpBuffer; + return sameSizedBuffer; + } + } else { + /* Buffer has changed, and must be recreated */ + CALBuffer* buffer = it->second; + buffer->waitCopyEvent(); + buffer->waitWriteEvent(); + + delete buffer; + _pinnedBufferCache.erase(it); + } + } + // Ceate if no free buffer in cache and push it in the cache if(!tmpBuffer->initializePinnedBuffer(cpuPtr, funcPtr)) { // If there is no same sized buffer, wait for all the events // associated to host resouces to finish and delete them. - for(unsigned int i = 0; i < _pinnedBufferCache.size(); ++i) + it = _pinnedBufferCache.begin(); + while(it!=_pinnedBufferCache.end()) { - CALBuffer* buffer = _pinnedBufferCache; + /* BUG: How do we know that this buffer isn't finished + with copy, but in use and awaitng kernel invokation ? */ + CALBuffer* buffer = it->second; buffer->waitCopyEvent(); buffer->waitWriteEvent(); delete buffer; + it++; } _pinnedBufferCache.clear(); @@ -1183,7 +1218,7 @@ } // Keep it in a cache after creation - _pinnedBufferCache.push_back(tmpBuffer); + _pinnedBufferCache[cpuPtr] = tmpBuffer; return tmpBuffer; } @@ -1537,9 +1572,11 @@ { // Destroy all the host side, pinned and constant buffers available in cache - for(unsigned int i = 0; i < _pinnedBufferCache.size(); ++i) + std::map<const void*, CALBuffer*>::iterator it = _pinnedBufferCache.begin(); + while(it!=_pinnedBufferCache.end()) { - delete _pinnedBufferCache; + delete it->second; + it++; } for(unsigned int i = 0; i < _hostBufferCache.size(); ++i) Index: runtime/CAL/Managers/CALBufferMgr.h =================================================================== --- runtime/CAL/Managers/CALBufferMgr.h (revision 340) +++ runtime/CAL/Managers/CALBufferMgr.h (working copy) @@ -164,7 +164,7 @@ std::vector<CALBuffer*> _hostPCIeBufferCache; //! \brief Cache for host memory data management. - std::vector<CALBuffer*> _pinnedBufferCache; + std::map<const void*, CALBuffer*> _pinnedBufferCache; //! \brief Cache containg constant buffers std::vector<CALConstBuffer*> _constBufferCache;

0 Likes

I don't quite understand what you're referring by "moving to pinned memory streams"?

As far as I know, brook+ does not allow creation of pinned streams; just pinned streams read/write. In which case, there's not intermediate buffer used for copying between your host pointer and the CAL resource, it's a direct copy between your C pointer and the CAL resource.

I call stream read with the same source pointers every time, but still new temporary buffers are created.

It's because the read/write is not pinned; meaning your C pointer passed as argument to the read/write does not conform to the CAL memory requirements for pinned memory (surface and pitch alignment)



0 Likes

Originally posted by: youplaboom

I don't quite understand what you're referring by "moving to pinned memory streams"?

As far as I know, brook+ does not allow creation of pinned streams; just pinned streams read/write. In which case, there's not intermediate buffer used for copying between your host pointer and the CAL resource, it's a direct copy between your C pointer and the CAL resource.

I call stream read with the same source pointers every time, but still new temporary buffers are created.

It's because the read/write is not pinned; meaning your C pointer passed as argument to the read/write does not conform to the CAL memory requirements for pinned memory (surface and pitch alignment)

Moving to is just  figure of speech. My application used regular streams, and I wanted to speed up copying from main memory to GPU memory by using the "nocopy" flag, as described in "Stream Computing User guide" section 2.14

Bit if you study what happens here

http://brookplus.svn.sourceforge.net/viewvc/brookplus/platform/runtime/CAL/Managers/CALBufferMgr.cpp?revision=334&view=markup

Line 1154 - 1189 - You will se that a new CAL resource is created for every transfere - and memory will quickly be exhausted, causing a complete flush which actually is time consuming. The net effect is that "nocopy" copying is most like slower that just ordinary read()

My patch illustrates how this can be solved, by keeping the tmp buffers in a map. and reusing them when size and cpuPtr are the same. This eliminates the need for regularly deleting a large number of buffers, speeding things up.

You may have a point about the pitch, DMA transfer would copy memory raw and perhaps miss out on any subtleties arising from pitch and tiling issues. But about my pointers not conforming to the pinning requirenments, I beg to differ, the call stack in my original post shows that they do, and otherwise I wouldn't have encountered these problems 🙂

F

 

0 Likes

Without knowing exactly how it is implemented in brook+, I would assume that a stream read() or write() with the "pinned" flag would:

1/ create a CAL resource using calResCreate1D/2D()

This resource is pinned to the C pointer provided, so no temporary memory is allocated; it assumes memory is allocated by the host program, with alignment and pitch considerations.

2/ execute a DMA transfer from/to this temporary resource to/from the GPU resource bound to the stream (I assume brook streams are always allocated on GPU?)

3/ delete the temporary pinned resource.

From my tests using CAL, the overhead of creating/deleting pinned CAL resources is minimal.

0 Likes

Originally posted by: youplaboom Without knowing exactly how it is implemented in brook+, I would assume...

 

If you had taken the time to read my previous post, you would would see that your assumptions are wrong, and pinned streams just doesn't work as advertised.

Not much help in this forum, I might as well keep my patches to myself. So rather than trying to fix brook, I will use the CAL APIs instead. Thanks for not taking time to read my posts.

Your assumptions fails at 3) The temp stream is not deleted. And a new "temp" stream is created for every read(.., "nocopy") that you make. - Evntually GPU memry runs out. At which point your application hangs. gaurav.garg has been very helpful in fixing the hang. But the problem still remains that temp buffers are not reused as they should be. I am sorry to sound liek a broken record on this matter, but thats what happens when people like you start posting without properly reading the previous thread.

 

Frank

0 Likes

Your assumptions fails at 3) The temp stream is not deleted. And a new "temp" stream is created for every read(.., "nocopy") that you make. - Evntually GPU memry runs out. At which point your application hangs. gaurav.garg has been very helpful in fixing the hang. But the problem still remains that temp buffers are not reused as they should be.


What Brook+ does right now is that if Pinned memory exhaust, it deletes previously allocated memories and tries to allocate the memory again.

Now, if you see application hang at memory exhaust, it is a CAL issue. One way to resolve this is to just delete the temporary stream as I suggested in my previous post. You can also implement caching and reuse pinned memory (based on pointer and CAL buffer size) but, this approach too would cause pinned memory exhaustion in case multiple streamReads are called with different host pointers.

0 Likes

Originally posted by: gaurav.garg
Your assumptions fails at 3) The temp stream is not deleted. And a new "temp" stream is created for every read(.., "nocopy") that you make. - Evntually GPU memry runs out. At which point your application hangs. gaurav.garg has been very helpful in fixing the hang. But the problem still remains that temp buffers are not reused as they should be.


 

What Brook+ does right now is that if Pinned memory exhaust, it deletes previously allocated memories and tries to allocate the memory again.

 

Now, if you see application hang at memory exhaust, it is a CAL issue. One way to resolve this is to just delete the temporary stream as I suggested in my previous post. You can also implement caching and reuse pinned memory (based on pointer and CAL buffer size) but, this approach too would cause pinned memory exhaustion in case multiple streamReads are called with different host pointers.

 

My above patch implements such an approach, it will any any case give better performance that the current SVN code. Unfortunatly I must be having some build problems of sorts. I am wondering do I have to recomplie my brook sources when switching from 1.4.0 distribution to SVN head ? (I am getting incorrect reults when using the brrok library that I buildt myself. So I can only see that my patch speeds things up, but I am not able to verify for correctness. )

What is the proper procedure for submitting patches to brook+?

F

 

0 Likes

You can file a bug and attach your patch file in that bug. You need to build only CALRuntime.

0 Likes