12 Replies Latest reply on Dec 14, 2008 4:55 PM by jean-claude

    Blue screen of death strikes again!   Any other victims??

    jean-claude
      atikmdag.sys breaks windows

      Hi guys,

      I suppose I'm not the only to experience "Blue screens" while

      running stream programs.

      The scenario is always the same:

      (1) after some successful runs, the same call to a sequence of kernel treatments generates a stall in the GPU and after a while the driver

      "automatically recovers from the device"

       

      (2) The system although seeming to have recovered works for some more

      minutes, and after a while ... here we go ==< boom; blue screen pops up

      and reference to ATIKMDAG error is noticed on the dead screen

       

      Over the past months, I found that:

      - extending windows time_out (in register) helps to a certain extent... but not always...

      - stopping ATIevent services seems to reduce the frequency of blue screens... but this doesn't fix the problem

       

      Just to be precise, I'm running 20 times/second a serie of 10 kernels,

      and typically the system breaks after 150 runs...

       

      HD 2600 - Vista 32 bits.

        • Blue screen of death strikes again!   Any other victims??
          MicahVillmow
          jean-claude, if you have a test case that can consistently reproduce this, please email it to streamdeveloper@amd.com with your system configuration, os version, sdk version, driver version, and steps to reproduce so that we can figure out what is going on and solve the problem.
            • Blue screen of death strikes again!   Any other victims??
              jean-claude

              Hi Micah,

              My code involves a lot of modules and Stream computing is only one part of it.

              Moreover the application is a realtime implying catching a video stream from  a camera. So I think it would be difficult to dig into it.

              Anyway, what I can do is to extract the kernel and brook call sequence code, and mail it to you.

              Would this be acceptable?

              Kind regards

              Jean-Claude

               

                • Blue screen of death strikes again!   Any other victims??
                  jean-claude

                  Hi again Micah,

                  I've done my homework and tried to synthesize a (very-simplified) structure

                  of a program that generates blue-screen crash.

                  I suspect that the problem may come from synchronisation scheme in Brook

                  (which is not  really clear), as this might explain that the blue screen

                  bug shows up only after several successive kernel treatment loops...

                  Thanks for having a look and provide some hints on how to approach the solution.

                  Kind regards

                  Jean-Claude

                   

                  Here is the sketch :


                  in kernels.br file :
                  ============


                  kernel kern_1( out float3 A<>, float3 B<> ) {...}

                  kernel kern_2( out float3 A<>, float3 B<> , float3 C<> ) {...}

                  kernel kern_3( out float3 A<>, float3 B<> ) {...}

                  kernel kern_4( out int3 A<>, int3 B<>, int3 C<> ) {...}

                  kernel kern_5( out float3 A<>, float3 B<>, float3 C<> ) {...}


                  in mainloop. cpp file :
                  ===============

                  // declaration of working streams
                  // ------------------------------
                  unsigned int dims[2] = { 576,720 };           

                  brook::Stream<float3> stream_f1(2,dims);   
                  brook::Stream<float3> stream_f2(2,dims);
                  brook::Stream<float3> stream_f3(2,dims);       
                  brook::Stream<int3>   stream_SS(2,dims);       

                  brook::Stream<float3> stream_FL(2,dims);   
                  brook::Stream<float3> stream_FR(2,dims);               
                  brook::Stream<int3>   stream_FS(2,dims);           

                  brook::Stream<float3> stream_BL(2,dims);   
                  brook::Stream<float3> stream_BR(2,dims);
                  brook::Stream<int3>   stream_BS(2,dims);           


                  // Brook processing
                  // ----------------------
                  void Process( float *In, float *Out) {

                      stream_f1.read(In);

                      // Run kernels
                      kern_1( stream_f2, stream_f1 );
                      kern_2( stream_R, stream_FS, stream_f2 );
                      kern_3( stream_FL, stream_FR );
                      kern_4( stream_SS, stream_FS, stream_BS);
                      kern_5 (stream_f3, stream_FL, stream_BL);

                      stream_f3.write(Out);
                  }


                  // Main treatment
                  // --------------------
                  void treatment(void) {

                      float dataIn[576*720*3];
                      float dataOut[576*720*3];

                      bool keep_looping=true;

                      ...

                      Get(dataIn);                           // grab one image
                      stream_BL.read(dataIn);        // read it within float3 stream

                      ... // additional processing

                      ... // compute and set permanent streams stream_BR, stream_BL, stream_BS

                      ...

                      while (keep_loopin) { // runs at video rate
                          Get(dataIn);                        // grab one image

                          Process (dataIn,dataOut);   // process it  <== AFTER SEVERAL LOOPS, THIS PROCESS IS CALLED BUT DOESN'T RETURN

                          Display(dataOut);               // show resulting image
                      }

                  }


                  .... AFTER A WHILE A MESSAGE SHOWS UP SAYING THAT THE GPU HAS BEEN NON RESPONSIVE AND WAS AUTOMATICALLY RECOVERED


                  .... LATER ON (COULD BE ONE MINUTE OR A DOZEN MINUTES) VISTA32 SYSTEM COLLAPSE INTO BLUE SCREEN, and tells ATIKMDAG has misbehaved...

                   

                    • Blue screen of death strikes again!   Any other victims??
                      gaurav.garg

                      Hi jean-claude,

                       

                      Not sure if you have already done this. You need to disable both VPU recover and watchdog timer to run any stream application for long time.

                        • Blue screen of death strikes again!   Any other victims??
                          jean-claude

                          Hi Gaurav,

                          Thanks for you feedback.

                          Yes, I did already changed my watchdog timer, as mentionned in one of

                          my former posts ( http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=100142&enterthread=y )

                          This helped to a certain extent, but doesn't fix the "sleeping bomb" phenomena, ie crashing after a certain time.

                          I'm even wondering if having set the wathdog timer to a large value could contribute to this delayed system bombing?

                          Having said so, I'll try to fully disabled the GPU recovery ( ie setting Vista register  HKLM\system\surrentcontrolset\control\graphicdrivers\TdrLevel to 0  which means No recover )

                           

                          Anyway, just basic questions to clarify things: when GPU is  exerciced through a processing loop involving several kernel treatments,

                          as for instance in :

                          for (int i = 1; i<=100; i++) {

                                 kern_1(...);

                                 kern_2(...);

                                 kern_3(...);

                          }

                          (1) what is considered to be the total GPU treatment time (when does Vista triggers the timer watchdog?)

                          (2)  assume kern_3 treatment if fully independent of kern_2, how is the overall synchronization performed? i.e is there any waiting point just to insure that the here 100 "loops" are not collapsing into a mega-stream treatment?

                          if yes, shouldn't we have a wait statement somewhere ? such as

                          for (int i = 1; i<=100; i++) {

                                 kern_1(...);

                                 kern_2(...);

                                 kern_3(...);

                                 wait_completion (kern_2,kern_3);

                          }

                          Thanks for clarifying the matter.

                          Regards

                          Jean-Claude

                           

                            • Blue screen of death strikes again!   Any other victims??
                              gaurav.garg

                              Hi Jean-Claude,

                               

                              GPUs can't run different kernels at the same time. So, there is a implicit synchronization after each kernel. It is possible that if you don't flush the command queue for a particular kernel previos kernel keep running. But, brook+ does flush all the commands for each kernel call.

                                • Blue screen of death strikes again!   Any other victims??
                                  jean-claude

                                  Thanks Gaurav,

                                  This clarifies a little bit the matter.

                                  Jus to let you know, I fully disabled  the GPU recovery(changed vista registry tdrlevel to 0) and as a result :

                                  I'm not getting "GPU recovery message" and neither blue screen (for the time being...) BUT the system get stalled after some time (4 to 5minutes) and need to be rebooted through power-down ....

                                  Back to the possible cause :

                                  It looks like after several successfull calls to the GPU kernel processing loop, one call doesn't return...

                                  So Is there anyway to get inside the processing loop a status report from the GPU???

                                   

                                    • Blue screen of death strikes again!   Any other victims??
                                      jean-claude

                                      BTW. I'm using Directshow too, so could it be anykind of deadlock interaction here  ?

                                        • Blue screen of death strikes again!   Any other victims??
                                          gaurav.garg

                                          Do you see this issue with only SDK 1.3 and Catalyst 8.12?

                                           

                                          It will be a good idea if you try experimenting with CAL 1.2.1 and Catalyst 8.10, keeping Brook+ 1.3 version.

                                            • Blue screen of death strikes again!   Any other victims??
                                              jean-claude

                                              Hi again,

                                              Same kind of issues did indeed exist under previous SDk and Catalyst versions  ( although - from time to time - the application did run a little bit longer... )

                                              I've been waiting for SDK1.3 hoping these issues would have been ironed up.

                                              To be honest it has taken me a while to convert the project to new syntax (proposed under 1.3) ,and other, so I'm not very keen to go back to previous CAL and Driver Environment.

                                              Especially my ATI driver experience has always been a real pain, since everytime I have to use "Mobility Modder" utility to have the new drivers installed on the portable PC I have dedicated to the project...

                                              If really needed I'm ready to go back to a mixture of SDK1.3 and previous CAL/Catalyst versions and work blindly.

                                              Having said so, it would certainly be more productive to figure out what can stall the GPU while processing a loop of kernels.

                                              Is there any way to query the GPU within the loop so that we get some form of dynamic status??

                                              Many thanks for your involvement and support.

                                              Jean-Claude

                                               

                                               

                                                • Blue screen of death strikes again!   Any other victims??
                                                  jean-claude

                                                  Ok, after a long conf and debug session this is where I'm at :

                                                  (1) The program works almost ok provided it runs in debug mode under Visual Studio 2008, on a step per step mode (ie one break point in the middle of kernel loop treatment and finger pushing F5... for another loop execution)

                                                  This again suggests a real time misbehaviour of CAL drivers when dealing with long loop of kernel execution.

                                                  (2) If the debug breakpoint is removed then ... boom! blue screen (same scenario as reported before)

                                                  (3) In order to clarify system parameters this is the Vista register set-up

                                                  HKLM\System\CurrentControlSet\Control\GraphicsDrivers

                                                  DxgKrnlVersion: Don't know what it is. My value is 0x1053

                                                  TdrLevel=1 – Bug check on detected timeout

                                                  TdrDelay= 60

                                                  TdrDdiDelay= 30

                                                  TdrDebugMode=0

                                                   

                                                  Any support is appreciated, thanks !