5 Replies Latest reply on Sep 10, 2009 9:50 AM by Raistmer

    How Brook+ runtime handles many processes using GPU?

    Raistmer
      some process-preemption related questions

      Lets consider 2 processes that make use of GPU, A and B
      Process A starts, allocates some buffers (global streams) on GPU, executes few kernels and then preempted by process B for some time.

      1) Will proces B be able to allocate almost all GPU memory or only that part leaved free after process A allocations?

      After control will be returned to process A again:
      2) Will process A GPU buffers refilled again from host memory or its datas rest in GPU memory while process A was preempted?
      3) need kernels of process A be recompiled again?
      4) need kernels of process A be loaded on GPU again?

      The reasons I ask these questions:
      1) calDeviceGetStatus() returns same memory amount (480MB of free GPU RAM for my HD4870 512MB) no matter how many other GPU-usings apps running.
      2) While running concurently with other GPU-using app I see few orders of magnitude (up to seconds!) increase of maximum (and mean) run time for some blocks (block includes stream read and kernel call). Block embraced in mutex that used by all GPU-related running apps so no GPU context switches inside such block.

      It looks like after reciving control on GPU again app should wait while Brook runtime will restore GPU state somehow and this restoration includes memory copy or kernel re-compilation - don't know why app should wait seconds to continue?....

      EDIT: and what recommended practice for GPU sharing?
      Should I use fine-grained GPU lock (like now) where only small part of whole task completed w/o releasing GPU to other apps or coarse-graining should be used where GPU should be hold by same process until allocated buffers will be not needed so app can reallocate all of them on next GPU acquisition?
        • How Brook+ runtime handles many processes using GPU?
          Gipsel

           

          Originally posted by: Raistmer Lets consider 2 processes that make use of GPU, A and B Process A starts, allocates some buffers (global streams) on GPU, executes few kernels and then preempted by process B for some time. 1) Will proces B be able to allocate almost all GPU memory or only that part leaved free after process A allocations? After control will be returned to process A again: 2) Will process A GPU buffers refilled again from host memory or its datas rest in GPU memory while process A was preempted? 3) need kernels of process A be recompiled again? 4) need kernels of process A be loaded on GPU again? The reasons I ask these questions: 1) calDeviceGetStatus() returns same memory amount (480MB of free GPU RAM for my HD4870 512MB) no matter how many other GPU-usings apps running. 2) While running concurently with other GPU-using app I see few orders of magnitude (up to seconds!) increase of maximum (and mean) run time for some blocks (block includes stream read and kernel call). Block embraced in mutex that used by all GPU-related running apps so no GPU context switches inside such block. It looks like after reciving control on GPU again app should wait while Brook runtime will restore GPU state somehow and this restoration includes memory copy or kernel re-compilation - don't know why app should wait seconds to continue?.... EDIT: and what recommended practice for GPU sharing? Should I use fine-grained GPU lock (like now) where only small part of whole task completed w/o releasing GPU to other apps or coarse-graining should be used where GPU should be hold by same process until allocated buffers will be not needed so app can reallocate all of them on next GPU acquisition?


          From my experience, different CAL contexts (opened by several processes using Brook for example) are more or less independent. That means the data loaded to the GPU memory in one context of course stays there, also when the other context is active. The same with kernel compilation. It doesn't need to be recompiled over and over again after each context switch.

          Otherwise my stuff wouldn't work as it does (reaching 175 double precision GFlops on average on a HD4870). In my case context switches happen every 30 milliseconds and each context allocates about 70MB in GPU memory. The speed is independent of the number of contexts, as long as they fit the memory (two contexts are even slightly faster than a single one without switches).

          If you see severe delays after context switches, my first guess would be some synchronization problem, i.e. some operation is still running and blocks the execution in the other context. This may also lead to some stability issues with VPU recovers and such. At least that is it what I have seen. If one submits kernel calls simultaneously from different contexts, there appears to be a possibility of some race condition blocking the kernel execution altogether resulting in a driver reset after some seconds. In other words, the whole stuff is obviously not thread safe, which is the reason a mutex is required to synchronize the kernel calls.

          I would implement such a GPU sharing feature only if it is absolutely necessary. It makes some sense, if one can't sustain a high GPU load from a single process. Otherwise it only complicates the whole application which shouldn't be necessary as BOINC is now starting to support the ATI GPUs.

          Edit: One could think about adding some kind of a CUDA stream equivalent to Brook. Being able to queue kernel calls (and also stream reads and writes) with the runtime managing the queue would be quite user friendly

            • How Brook+ runtime handles many processes using GPU?
              Raistmer
              Thanks!
              Yes, in my case GPU used only for one of searches wich app does(still awaiting good GPU FFT library from ATI to go further ). So GPU sharing required but can be done more coarse-grained than now of course. Will explore possibility of synch issues too.
                • How Brook+ runtime handles many processes using GPU?
                  Raistmer
                  Well, it seems Brook+ runtime adds additional layer that can behave pretty differently from CAL runtime.

                  You see fast context switches as long as total memory allocated by apps fits in GPU total RAM available.
                  But what will do if not?

                  I would expect memory allocation routine fails with some error code ?

                  But in Brook+ app I see something very different.
                  Currently 4 instances of my app run simultaneously w/o failure memory allocation.

                  Most of memory allocated at very beginning of app and stays allocated for whole app life duration.

                  Each instance require >150MB of GPU RAM, card has 512MB
                  4x150=600>512 - how is it possible?
                  The only explanation I see - brook runtime do memory swap!

                  Any ideas??

                    • How Brook+ runtime handles many processes using GPU?
                      Gipsel

                       

                      Originally posted by: Raistmer Well, it seems Brook+ runtime adds additional layer that can behave pretty differently from CAL runtime.


                      It acts like a wrapper and caches some stuff. You can see what it is doing as the source code is delivered with the SDK.

                       

                      You see fast context switches as long as total memory allocated by apps fits in GPU total RAM available. But what will do if not?


                      It gets slower.

                       

                      I would expect memory allocation routine fails with some error code ? But in Brook+ app I see something very different. Currently 4 instances of my app run simultaneously w/o failure memory allocation. Most of memory allocated at very beginning of app and stays allocated for whole app life duration. Each instance require >150MB of GPU RAM, card has 512MB 4x150=600>512 - how is it possible? The only explanation I see - brook runtime do memory swap! Any ideas??


                      Yes, it fails only if you exceed the card's memory a lot. The simple explanation I came up for myself is that the driver (not the Brook runtime) handles that swapping. Actually it should be the same as a game using more texture data than memory on the card (streams are usually equivalent to textures). These data can be either in the GPU memory (fast) or in the host memory (slower). The driver tries to manage that to get the highest performance, but quite a lot of data may be transferred over the interface which slows things down of course.