25 Replies Latest reply on Feb 9, 2014 8:58 AM by nou

    Multiple contexts parallel allocating or writing to memory of a single device


      Hello, I have a program which uses openmp to schedule work in parallel to one opencl device i.e a gpu. This is done right now by using multiple contexts and which have there own unique queues and buffers. The program stops after some iteration steps. I mean it just stops, without exiting or segmentation fault or something. Could it be that the allocation from multiple contexts is not thread safe? Do I have to use one context and a queue for each thread (which is my choice for the future anyway)? Btw. this only happens on a GPU device. CPU devices work fine.


      Thx in advance.

        • Re: Multiple contexts parallel allocating or writing to memory of a single device

          Multiple threads operating on a context is supported from OpenCL 1.1. All OpenCL calls  are thread-safe except "clSetKernelArg". Even with this API, multiple threads can still work with unique cl_kernel objects. However, they cannot wok with the same cl_kernel object at the same time. So, per-thread allocation of "cl_kernel" object will help overcome this issue.

          Check Appendix A.2 of OpenCL Spec. So, as long as your platform is OpenCL 1.1 or later, you can use just 1 context and allow all your openmp threads to work.


          However, if multiple threads are reading/writing shared "cl_mem" objects across multiple command queues -- then this can result in undefined behaviour. Check Appendix A.1 of the OpenCL Spec. That will help resolve all your doubts.


          Now coming to the issue you are facing,

          I am not sure what you mean the program stops...but no seg-fault. You may want to first find out until which point the application is running. (or) Please post your sources as a standalone zip file which we can use to reproduce here.

          You need to also specify the following:

          1. Platform - win32 / win64 / lin32 / lin64 or some other?

              Win7 or win vista or Win8.. Similarly for linux, your distribution

          2. Version of driver

          3. CPU or GPU Target?

          4. CPU/GPU details of your hardware


          • Re: Multiple contexts parallel allocating or writing to memory of a single device

            It has been a while, but my problem still exists. My above responses weren't accurate because the remote access didn't use the GPU but only found the CPU. The classic healess problem. I am now able to access the GPU remotely but then there is my "stopping" problem again. I believe a deadlock is happening when releasing a memory object in the multiple command_queue called by multiple threads scenario. Here is a part of my debug log taken when the execution stops:

            debug]#0  0x00007ffff582d420 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
            [debug]#1  0x00007fffef1f9ba0 in amd::Semaphore::wait() () from /usr/lib/libamdocl64.so
            [debug]#2  0x00007fffef1f6162 in amd::Monitor::finishLock() () from /usr/lib/libamdocl64.so
            [debug]#3  0x00007fffef21f6fc in gpu::Device::ScopedLockVgpus::ScopedLockVgpus(gpu::Device const&) () from /usr/lib/libamdocl64.so
            [debug]#4  0x00007fffef242c3e in gpu::Resource::free() () from /usr/lib/libamdocl64.so
            [debug]#5  0x00007fffef243207 in gpu::Resource::~Resource() () from /usr/lib/libamdocl64.so
            [debug]#6  0x00007fffef22fd3d in gpu::Memory::~Memory() () from /usr/lib/libamdocl64.so
            [debug]#7  0x00007fffef23123f in gpu::Buffer::~Buffer() () from /usr/lib/libamdocl64.so
            [debug]#8  0x00007fffef1e8998 in amd::Memory::~Memory() () from /usr/lib/libamdocl64.so
            [debug]#9  0x00007fffef1e9607 in amd::Buffer::~Buffer() () from /usr/lib/libamdocl64.so
            [debug]#10 0x00007fffef1f41eb in amd::ReferenceCountedObject::release() () from /usr/lib/libamdocl64.so
            [debug]#11 0x00007fffef1c5a37 in clReleaseMemObject () from /usr/lib/libamdocl64.so

            I will try to finally reproduce this by focusing on threaded allocating and releasing memory in a minimal example. Hopefully this is leading somewhere. It would be nice to solve this to convince my boss to by some of the 7990 cards for our computing.