It has been a while, but my problem still exists. My above responses weren't accurate because the remote access didn't use the GPU but only found the CPU. The classic healess problem. I am now able to access the GPU remotely but then there is my "stopping" problem again. I believe a deadlock is happening when releasing a memory object in the multiple command_queue called by multiple threads scenario. Here is a part of my debug log taken when the execution stops:
debug]#0 0x00007ffff582d420 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
[debug]#1 0x00007fffef1f9ba0 in amd::Semaphore::wait() () from /usr/lib/libamdocl64.so
[debug]#2 0x00007fffef1f6162 in amd::Monitor::finishLock() () from /usr/lib/libamdocl64.so
[debug]#3 0x00007fffef21f6fc in gpu::Device::ScopedLockVgpus::ScopedLockVgpus(gpu::Device const&) () from /usr/lib/libamdocl64.so
[debug]#4 0x00007fffef242c3e in gpu::Resource::free() () from /usr/lib/libamdocl64.so
[debug]#5 0x00007fffef243207 in gpu::Resource::~Resource() () from /usr/lib/libamdocl64.so
[debug]#6 0x00007fffef22fd3d in gpu::Memory::~Memory() () from /usr/lib/libamdocl64.so
[debug]#7 0x00007fffef23123f in gpu::Buffer::~Buffer() () from /usr/lib/libamdocl64.so
[debug]#8 0x00007fffef1e8998 in amd::Memory::~Memory() () from /usr/lib/libamdocl64.so
[debug]#9 0x00007fffef1e9607 in amd::Buffer::~Buffer() () from /usr/lib/libamdocl64.so
[debug]#10 0x00007fffef1f41eb in amd::ReferenceCountedObject::release() () from /usr/lib/libamdocl64.so
[debug]#11 0x00007fffef1c5a37 in clReleaseMemObject () from /usr/lib/libamdocl64.so
I will try to finally reproduce this by focusing on threaded allocating and releasing memory in a minimal example. Hopefully this is leading somewhere. It would be nice to solve this to convince my boss to by some of the 7990 cards for our computing.
Thanks for the update. We look forward to your test case.
I would suggest to go through the Appendix A of OpenCL programming guide for some guidance.
Did you get any further with this? I'm pretty sure I have the same problem (I don't think it's happening on CPU)
10 queues (one per thread)
20 kernels (one cl_kernel is instanced every use for each thread so they're not shared)
I'm blocking all writes and reads, and blocking my executions with clWaitEvent immediately after clEnqueueNDRangeKernel.
(Things seem to hang much earlier if I don't block everything, but I'm not sure yet if it's the same issue)
The faster my code works, and the more work I throw at it, the quicker it hangs. (more memory object allocation/deallocation)
Whenever it stops (just as described above) one thread ALWAYS just happens to be releasing a memory object (the others are usually reading/writing)
I understand the object release is threadsafe... (I'm doing it VERY regularly, say, 10 times per kernel, per thread)
In my case should I have *any* mutex's? I don't currently other than for some management on the host side.
Windows 7, driver version in device manager is 126.96.36.199. (I think I'm still using beta drivers)
For my problem, I think this is the source:
Still, to this point I was not able to reproduce the Problem in a simple example but I also don`t have much time to invest in this. Anyway the since the problem only occours with the AMD GPU runtime it seems to be driver related. It happens either if I have one context created by the main thread and accesed by multiple different threads or if I have multiple contexts created by the main thread and accessed by multiple different threads. Note also that in the latter case no shared memory objects or kernels are used at all.
I realised my image-memory objects weren't using the correct queue (all were using a "default" queue which the kernels weren't using), not sure why the system still worked, but this may be the cause; not that the hang/deadlock was related to any memory objects or kernels that were using the image objects at the time.
I added a host-side mutex when releasing memory objects, no help.
I then used that mutex when reading/writing to any memory object, where I then discovered my issue with image-memory-objects.
I'll update shortly if my problem has gone away, but currently my driver crashes before it hangs (though it's running for a lot longer) which I think is a OOB memory access as it gives me a memory violation when I execute on CPU instead of GPU
buffers are automaticaly copy between devices. but OpenCL runtime will place buffer on that device which queue is associated with.
Thanks for posting back and the quick experiments.
Will await a nice repro-case so that we can start working on this..