10 Replies Latest reply on Aug 18, 2009 6:33 PM by Raistmer

    4 instances of same app using brook lead to driver restart

    Raistmer
      while 1 instance runs OK

      AFAIK long kernel could lead to driver restart if it executed more than 2 seconds under Vista.
      No one of kernels in my app took so long. Moreover, I increased driver restart limit via registry to 15 seconds.
      Though running 4 copies of app lead to driver restart time to time.
      Why brook runtime can do correct scheduling to avoid long driver unavailability to OS watchdog timer?
        • 4 instances of same app using brook lead to driver restart
          Gipsel

           

          Originally posted by: Raistmer AFAIK long kernel could lead to driver restart if it executed more than 2 seconds under Vista. No one of kernels in my app took so long. Moreover, I increased driver restart limit via registry to 15 seconds. Though running 4 copies of app lead to driver restart time to time. Why brook runtime can do correct scheduling to avoid long driver unavailability to OS watchdog timer?


          Are all four instances using the same GPU? For such cases I use a mutex per GPU to allow only the start of one simultaneous kernel per GPU (the waiting kernels are served in a round robin fashion then).

            • 4 instances of same app using brook lead to driver restart
              Raistmer
              Yes, all apps use the same GPU.
              Thanks for suggestion.
              I thought such serialization is done on driver level... or at least in brook/cal runtime %)
                • 4 instances of same app using brook lead to driver restart
                  Raistmer
                  @Gipsel
                  Do you use named mutex with MW opt GPU app?
                  Is it possible to use that mutex to serialize GPU access between few BOINC apps?
                  If yes what its name?
                    • 4 instances of same app using brook lead to driver restart
                      Gipsel

                       

                      Originally posted by: Raistmer @Gipsel Do you use named mutex with MW opt GPU app? Is it possible to use that mutex to serialize GPU access between few BOINC apps? If yes what its name?


                      The same mutex names are already used at MW@home as well as Collatz@home. And yes, running Collatz and MW on the same GPU works, although the current Collatz app uses much smaller execution domains than the MW one (and doesn't get the multi GPU stuff right), so the GPU time is not evenly split between those two apps. But that will hopefully change with the next version where I have more influence than on the current one.

                      But it may become superfluous with the modified client versions of Crunch3r (the modifications are now already in the official development versions), as it starts only a single instance per GPU and tells the app with a command line parameter ("--device #") which GPU to use (I guess it's the same behaviour as with CUDA).

                      Nevertheless, this is the code fragment which constructs the mutex names (as I mentioned one per GPU exists). The mutex names are "Global\\Milkyway_ATI_GPU_App_Mutex#", with "#" being the device number of the used GPU (the "which_device" variable in the code below).

                      char mutex_name[64]; strcpy(mutex_name, "Global\\Milkyway_ATI_GPU_App_Mutex"); [..] itoa(which_device, &(mutex_name[strlen(mutex_name)]),10); // construct mutex name for the chosen GPU GPU_mutex = CreateMutex(&GPU_secatt,false,mutex_name); // opens named mutex, open it, if it already exists, but never obtain it directly if (GPU_mutex==NULL) // if it fails { GPU_mutex=OpenMutex(MUTEX_MODIFY_STATE,false,mutex_name); // try again with less rights if (GPU_mutex==NULL) { cerr<<"Couldn't obtain mutex for GPU access!"<<endl<<flush; return(1); } } // kernel calls are enclosed in the following construct WaitForSingleObject(GPU_mutex,INFINITE); // obtain mutex (waiting for the GPU to become available), wait forever, if necessary GPU_time_s = dtime(); [.. kernel calls ..] GPU_time += dtime() - GPU_time_s; ReleaseMutex(GPU_mutex);

                        • 4 instances of same app using brook lead to driver restart
                          Raistmer
                          Ok, thank you very much for info!
                          For current state of AP (only FFA ported on GPU) launching single app per GPU will be waste of GPU, so better I will use mutexes
                          But it would be great if AP would play nicely with other GPU apps
                            • 4 instances of same app using brook lead to driver restart
                              Raistmer
                              "&GPU_secatt"
                              Do you use some specific access rights? Just NULL will not go?
                                • 4 instances of same app using brook lead to driver restart
                                  Gipsel

                                   

                                  Originally posted by: Raistmer "&GPU_secatt" Do you use some specific access rights? Just NULL will not go?


                                  I don't use specific rights in the moment, but I was thinking about it, because the default access rights don't allow another user to access the same mutex. That means one can't test an application standalone when another instance is launched by the BOINC client. But it doesn't matter on a normal system.

                                  GPU_secatt.lpSecurityDescriptor=NULL; GPU_secatt.bInheritHandle=false; GPU_secatt.nLength=sizeof(GPU_secatt);

                                    • 4 instances of same app using brook lead to driver restart
                                      Raistmer
                                      Thanks again!
                                      Currently testing 3 AP+1 MW running - will see if driver restarts ended.

                                      EDIT: much more stable now, only single driver restart so far
                                        • 4 instances of same app using brook lead to driver restart
                                          Gipsel

                                           

                                          Originally posted by: Raistmer

                                          EDIT: much more stable now, only single driver restart so far



                                          Are you testing with Vista or WinXP? I would really like to know if the stability problems with newer drivers under XP are gone with the SDK1.4 you use (at least I guess you are using 1.4).

                                            • 4 instances of same app using brook lead to driver restart
                                              Raistmer
                                              Originally posted by: Gipsel

                                              Originally posted by: Raistmer




                                              EDIT: much more stable now, only single driver restart so far





                                              Are you testing with Vista or WinXP? I would really like to know if the stability problems with newer drivers under XP are gone with the SDK1.4 you use (at least I guess you are using 1.4).



                                              Well, I use Vista x86, Catalyst 9.2 (cause newer versions can't support 1D stream size >8192 and I need such sizes) and SDK 1.4 beta (Brook+ only, no hand-made kernels on IL still)
                                              With only MW (default settings) it works very stable, only one system (GUI actually, filesystem was accessible w/o problems remotely) freeze for few months.
                                              Sure when I run simething like AOE III with MW active driver restart guarantied , but if no 3D stuff running all just OK.
                                              It's Q9450-based host.

                                              But when I tried to launch MW on another host, AMD Athlon 64 based, WinXP x86 , I get driver recoveries or system BSoDs always sooner or later. Record is few MW tasks done and reported. Usualy it hangs before completing first task.
                                              I tried: Catalyst 8.12, 9.1, 9.2 - no success....
                                              Also all slowdown options like n1 f100 w2 - sooner or later but system freezes...