cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jean-claude
Journeyman III

Blue screen of death strikes again! Any other victims??

atikmdag.sys breaks windows

Hi guys,

I suppose I'm not the only to experience "Blue screens" while

running stream programs.

The scenario is always the same:

(1) after some successful runs, the same call to a sequence of kernel treatments generates a stall in the GPU and after a while the driver

"automatically recovers from the device"

 

(2) The system although seeming to have recovered works for some more

minutes, and after a while ... here we go ==< boom; blue screen pops up

and reference to ATIKMDAG error is noticed on the dead screen

 

Over the past months, I found that:

- extending windows time_out (in register) helps to a certain extent... but not always...

- stopping ATIevent services seems to reduce the frequency of blue screens... but this doesn't fix the problem

 

Just to be precise, I'm running 20 times/second a serie of 10 kernels,

and typically the system breaks after 150 runs...

 

HD 2600 - Vista 32 bits.

0 Likes
12 Replies

jean-claude, if you have a test case that can consistently reproduce this, please email it to streamdeveloper@amd.com with your system configuration, os version, sdk version, driver version, and steps to reproduce so that we can figure out what is going on and solve the problem.
0 Likes

Hi Micah,

My code involves a lot of modules and Stream computing is only one part of it.

Moreover the application is a realtime implying catching a video stream from  a camera. So I think it would be difficult to dig into it.

Anyway, what I can do is to extract the kernel and brook call sequence code, and mail it to you.

Would this be acceptable?

Kind regards

Jean-Claude

 

0 Likes

Hi again Micah,

I've done my homework and tried to synthesize a (very-simplified) structure

of a program that generates blue-screen crash.

I suspect that the problem may come from synchronisation scheme in Brook

(which is not  really clear), as this might explain that the blue screen

bug shows up only after several successive kernel treatment loops...

Thanks for having a look and provide some hints on how to approach the solution.

Kind regards

Jean-Claude

 

Here is the sketch :


in kernels.br file :
============


kernel kern_1( out float3 A<>, float3 B<> ) {...}

kernel kern_2( out float3 A<>, float3 B<> , float3 C<> ) {...}

kernel kern_3( out float3 A<>, float3 B<> ) {...}

kernel kern_4( out int3 A<>, int3 B<>, int3 C<> ) {...}

kernel kern_5( out float3 A<>, float3 B<>, float3 C<> ) {...}


in mainloop. cpp file :
===============

// declaration of working streams
// ------------------------------
unsigned int dims[2] = { 576,720 };           

brook::Stream<float3> stream_f1(2,dims);   
brook::Stream<float3> stream_f2(2,dims);
brook::Stream<float3> stream_f3(2,dims);       
brook::Stream<int3>   stream_SS(2,dims);       

brook::Stream<float3> stream_FL(2,dims);   
brook::Stream<float3> stream_FR(2,dims);               
brook::Stream<int3>   stream_FS(2,dims);           

brook::Stream<float3> stream_BL(2,dims);   
brook::Stream<float3> stream_BR(2,dims);
brook::Stream<int3>   stream_BS(2,dims);           


// Brook processing
// ----------------------
void Process( float *In, float *Out) {

    stream_f1.read(In);

    // Run kernels
    kern_1( stream_f2, stream_f1 );
    kern_2( stream_R, stream_FS, stream_f2 );
    kern_3( stream_FL, stream_FR );
    kern_4( stream_SS, stream_FS, stream_BS);
    kern_5 (stream_f3, stream_FL, stream_BL);

    stream_f3.write(Out);
}


// Main treatment
// --------------------
void treatment(void) {

    float dataIn[576*720*3];
    float dataOut[576*720*3];

    bool keep_looping=true;

    ...

    Get(dataIn);                           // grab one image
    stream_BL.read(dataIn);        // read it within float3 stream

    ... // additional processing

    ... // compute and set permanent streams stream_BR, stream_BL, stream_BS

    ...

    while (keep_loopin) { // runs at video rate
        Get(dataIn);                        // grab one image

        Process (dataIn,dataOut);   // process it  <== AFTER SEVERAL LOOPS, THIS PROCESS IS CALLED BUT DOESN'T RETURN

        Display(dataOut);               // show resulting image
    }

}


.... AFTER A WHILE A MESSAGE SHOWS UP SAYING THAT THE GPU HAS BEEN NON RESPONSIVE AND WAS AUTOMATICALLY RECOVERED


.... LATER ON (COULD BE ONE MINUTE OR A DOZEN MINUTES) VISTA32 SYSTEM COLLAPSE INTO BLUE SCREEN, and tells ATIKMDAG has misbehaved...

 

0 Likes

Hi jean-claude,

 

Not sure if you have already done this. You need to disable both VPU recover and watchdog timer to run any stream application for long time.

0 Likes

Hi Gaurav,

Thanks for you feedback.

Yes, I did already changed my watchdog timer, as mentionned in one of

my former posts ( http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=100142&enterthread=y )

This helped to a certain extent, but doesn't fix the "sleeping bomb" phenomena, ie crashing after a certain time.

I'm even wondering if having set the wathdog timer to a large value could contribute to this delayed system bombing?

Having said so, I'll try to fully disabled the GPU recovery ( ie setting Vista register  HKLM\system\surrentcontrolset\control\graphicdrivers\TdrLevel to 0  which means No recover )

 

Anyway, just basic questions to clarify things: when GPU is  exerciced through a processing loop involving several kernel treatments,

as for instance in :

for (int i = 1; i<=100; i++) {

       kern_1(...);

       kern_2(...);

       kern_3(...);

}

(1) what is considered to be the total GPU treatment time (when does Vista triggers the timer watchdog?)

(2)  assume kern_3 treatment if fully independent of kern_2, how is the overall synchronization performed? i.e is there any waiting point just to insure that the here 100 "loops" are not collapsing into a mega-stream treatment?

if yes, shouldn't we have a wait statement somewhere ? such as

for (int i = 1; i<=100; i++) {

       kern_1(...);

       kern_2(...);

       kern_3(...);

       wait_completion (kern_2,kern_3);

}

Thanks for clarifying the matter.

Regards

Jean-Claude

 

0 Likes

Hi Jean-Claude,

 

GPUs can't run different kernels at the same time. So, there is a implicit synchronization after each kernel. It is possible that if you don't flush the command queue for a particular kernel previos kernel keep running. But, brook+ does flush all the commands for each kernel call.

0 Likes

Thanks Gaurav,

This clarifies a little bit the matter.

Jus to let you know, I fully disabled  the GPU recovery(changed vista registry tdrlevel to 0) and as a result :

I'm not getting "GPU recovery message" and neither blue screen (for the time being...) BUT the system get stalled after some time (4 to 5minutes) and need to be rebooted through power-down ....

Back to the possible cause :

It looks like after several successfull calls to the GPU kernel processing loop, one call doesn't return...

So Is there anyway to get inside the processing loop a status report from the GPU???

 

0 Likes

BTW. I'm using Directshow too, so could it be anykind of deadlock interaction here  ?

0 Likes

Do you see this issue with only SDK 1.3 and Catalyst 8.12?

 

It will be a good idea if you try experimenting with CAL 1.2.1 and Catalyst 8.10, keeping Brook+ 1.3 version.

0 Likes

Hi again,

Same kind of issues did indeed exist under previous SDk and Catalyst versions  ( although - from time to time - the application did run a little bit longer... )

I've been waiting for SDK1.3 hoping these issues would have been ironed up.

To be honest it has taken me a while to convert the project to new syntax (proposed under 1.3) ,and other, so I'm not very keen to go back to previous CAL and Driver Environment.

Especially my ATI driver experience has always been a real pain, since everytime I have to use "Mobility Modder" utility to have the new drivers installed on the portable PC I have dedicated to the project...

If really needed I'm ready to go back to a mixture of SDK1.3 and previous CAL/Catalyst versions and work blindly.

Having said so, it would certainly be more productive to figure out what can stall the GPU while processing a loop of kernels.

Is there any way to query the GPU within the loop so that we get some form of dynamic status??

Many thanks for your involvement and support.

Jean-Claude

 

 

0 Likes

Ok, after a long conf and debug session this is where I'm at :

(1) The program works almost ok provided it runs in debug mode under Visual Studio 2008, on a step per step mode (ie one break point in the middle of kernel loop treatment and finger pushing F5... for another loop execution)

This again suggests a real time misbehaviour of CAL drivers when dealing with long loop of kernel execution.

(2) If the debug breakpoint is removed then ... boom! blue screen (same scenario as reported before)

(3) In order to clarify system parameters this is the Vista register set-up

HKLM\System\CurrentControlSet\Control\GraphicsDrivers

DxgKrnlVersion: Don't know what it is. My value is 0x1053

TdrLevel=1 – Bug check on detected timeout

TdrDelay= 60

TdrDdiDelay= 30

TdrDebugMode=0

 

Any support is appreciated, thanks !

 

 

 

0 Likes

BTW. Just tried experimenting with CAL 1.2.1 and Catalyst 8.10, keeping Brook+ 1.3 version.

Same problem, even worse: the blue screen comes sooner!

I really think there is a recuring problem with CAL/Drivers implementation on Vista.


0 Likes