2 Replies Latest reply on Feb 5, 2010 9:06 AM by frankas

    Avoiding busy wait. Posix signals etc ???

    frankas

      As noted in other threads here, for heavy work on multi GPU setup, it seems to be a requirement to have 1 CPU core pr GPU. The irony of this is that for a 3 x 5970 setup, one is forced to choose an Intel CPU isgust; (ex i7)

      The problem seems to be rooted in the fact that there aren't any Posix type signals (on Linux) at least that the CPU threads can wait for to get notification that a GPU operation has completed.

      I could try to code around this limitation by having more GPU jobs in the pipeline, and guessing how long usleep() it is safe to call before awaking. But this far from an ideal situation.

      But perhaps there is such a signal available, of which the documentation has eluded me ?

       

        • Avoiding busy wait. Posix signals etc ???
          empty_knapsack

          There is undocumented function calCtxWaitForEvents(), at least it exists in Windows, it can be acquired with

          if (calExtSupported(CAL_EXT_8009) == CAL_RESULT_OK) {
            if (calExtGetProc((CALextproc*)&calCtxWaitForEvents, CAL_EXT_8009, "calCtxWaitForEvents") == CAL_RESULT_OK) ...

          with prototype:

          typedef CALresult (CALAPIENTRYP PFNCALCTXWAITFOREVENTS)(CALcontext ctx, CALevent *event, CALuint num, CALuint flags);

          (I'm unsure about flags value, must be always == 0 I guess)

          It used by ATI's OpenCL but not documented anywhere. When used it waits for GPU kernel (probably will works with memcopy too) to complete and while it isn't true CPU usage reduced to zero. However it normally works only with one GPU per process as it blocks every context created by process not just working one. 

          ... It really amazing that at 2010 ATI still cannot provide normal multi-threaded DLL to work with CAL (calResMap and some others also blocks everything when used), so inelegant to use IPC when it doesn't needed at all...

           

            • Avoiding busy wait. Posix signals etc ???
              frankas

               

              Originally posted by: empty_knapsack[/i

              ... It really amazing that at 2010 ATI still cannot provide normal multi-threaded DLL to work with CAL (calResMap and some others also blocks everything when used), so inelegant to use IPC when it doesn't needed at all...

               



              I am in the process of moving code from using  brook to CALfor memory managment. I am wondering what the nature of this blocking is. I would like to access remote cachable memory with the CPU while other kernels are running. Can I expect everything else to block between the map() and unmap() call ?

              Update:

              I am running a single thread of code that is very heavy on GPU computation. The CPU thread is in a tight loop that does on out these 4:

              1) Initiate DMA

              2) Initate Kernel execution (23ms compute time)

              3) Map - CPU process in cached memory (0.7ms)  - Unmap

              4) Wait for event

               

              When I run this on a single GPU in 2 contexts, calCtx...Counter shows idle times in the region of 3-7%. But when I run the same thread with 4 context on 2 GPUs (2+2) I have 2 contexts showing 3-7% , and the 2 others showing 36-40% idle time.

              But the strange thing is that each GPU has 1 "good" and 1 "bad" context - so these numbers me very well be a bug in the calCounters. Since there are multiple GPU operations between each Begin and End call, one should expect to see the exact same numbers for for all contexts on a GPU, not the strange figures of 3% and 37%

              However performance has really taken a hit, it should scale almost linearly, but I get:

              1 GPU:  2200 completed jobs / second

              2GPUs: 3300 completed jobs a second.  ( 75% efficiency )

              I wil try to detect locking in the kernel, but it is quite obvious that a fork() is needed to bring performance up to expected levels.

              More updates:

              I discoverd quite unsurprisingly that the Flush calls can take take a long time to complete. But what was stranger is if you execute a kernel, and then immediatly wait for the event, you get a long stall. waiting for anything else is fine, and helps keep things flowing.

              However the multi GPU slowdown I get even happens across multiple processes, after a removed these potential sources of locking. ( Together with the strange idle figures) Brook+ somehow manages to avoid the issue that I am having, so I must be doing something "wrong".

              ...

              and that turns out to be to have multiple contexts pr GPU. I was trying to avoid (re)Binding kernel input / outputs all the time by keeping a circular buffer of contexts. Suddenly this all seems very fixable.