Hi guys,
I suppose I'm not the only to experience "Blue screens" while
running stream programs.
The scenario is always the same:
(1) after some successful runs, the same call to a sequence of kernel treatments generates a stall in the GPU and after a while the driver
"automatically recovers from the device"
(2) The system although seeming to have recovered works for some more
minutes, and after a while ... here we go ==< boom; blue screen pops up
and reference to ATIKMDAG error is noticed on the dead screen
Over the past months, I found that:
- extending windows time_out (in register) helps to a certain extent... but not always...
- stopping ATIevent services seems to reduce the frequency of blue screens... but this doesn't fix the problem
Just to be precise, I'm running 20 times/second a serie of 10 kernels,
and typically the system breaks after 150 runs...
HD 2600 - Vista 32 bits.
Hi Micah,
My code involves a lot of modules and Stream computing is only one part of it.
Moreover the application is a realtime implying catching a video stream from a camera. So I think it would be difficult to dig into it.
Anyway, what I can do is to extract the kernel and brook call sequence code, and mail it to you.
Would this be acceptable?
Kind regards
Jean-Claude
Hi again Micah,
I've done my homework and tried to synthesize a (very-simplified) structure
of a program that generates blue-screen crash.
I suspect that the problem may come from synchronisation scheme in Brook
(which is not really clear), as this might explain that the blue screen
bug shows up only after several successive kernel treatment loops...
Thanks for having a look and provide some hints on how to approach the solution.
Kind regards
Jean-Claude
Here is the sketch :
in kernels.br file :
============
kernel kern_1( out float3 A<>, float3 B<> ) {...}
kernel kern_2( out float3 A<>, float3 B<> , float3 C<> ) {...}
kernel kern_3( out float3 A<>, float3 B<> ) {...}
kernel kern_4( out int3 A<>, int3 B<>, int3 C<> ) {...}
kernel kern_5( out float3 A<>, float3 B<>, float3 C<> ) {...}
in mainloop. cpp file :
===============
// declaration of working streams
// ------------------------------
unsigned int dims[2] = { 576,720 };
brook::Stream<float3> stream_f1(2,dims);
brook::Stream<float3> stream_f2(2,dims);
brook::Stream<float3> stream_f3(2,dims);
brook::Stream<int3> stream_SS(2,dims);
brook::Stream<float3> stream_FL(2,dims);
brook::Stream<float3> stream_FR(2,dims);
brook::Stream<int3> stream_FS(2,dims);
brook::Stream<float3> stream_BL(2,dims);
brook::Stream<float3> stream_BR(2,dims);
brook::Stream<int3> stream_BS(2,dims);
// Brook processing
// ----------------------
void Process( float *In, float *Out) {
stream_f1.read(In);
// Run kernels
kern_1( stream_f2, stream_f1 );
kern_2( stream_R, stream_FS, stream_f2 );
kern_3( stream_FL, stream_FR );
kern_4( stream_SS, stream_FS, stream_BS);
kern_5 (stream_f3, stream_FL, stream_BL);
stream_f3.write(Out);
}
// Main treatment
// --------------------
void treatment(void) {
float dataIn[576*720*3];
float dataOut[576*720*3];
bool keep_looping=true;
...
Get(dataIn); // grab one image
stream_BL.read(dataIn); // read it within float3 stream
... // additional processing
... // compute and set permanent streams stream_BR, stream_BL, stream_BS
...
while (keep_loopin) { // runs at video rate
Get(dataIn); // grab one image
Process (dataIn,dataOut); // process it <== AFTER SEVERAL LOOPS, THIS PROCESS IS CALLED BUT DOESN'T RETURN
Display(dataOut); // show resulting image
}
}
.... AFTER A WHILE A MESSAGE SHOWS UP SAYING THAT THE GPU HAS BEEN NON RESPONSIVE AND WAS AUTOMATICALLY RECOVERED
.... LATER ON (COULD BE ONE MINUTE OR A DOZEN MINUTES) VISTA32 SYSTEM COLLAPSE INTO BLUE SCREEN, and tells ATIKMDAG has misbehaved...
Hi jean-claude,
Not sure if you have already done this. You need to disable both VPU recover and watchdog timer to run any stream application for long time.
Hi Gaurav,
Thanks for you feedback.
Yes, I did already changed my watchdog timer, as mentionned in one of
my former posts ( http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=100142&enterthread=y )
This helped to a certain extent, but doesn't fix the "sleeping bomb" phenomena, ie crashing after a certain time.
I'm even wondering if having set the wathdog timer to a large value could contribute to this delayed system bombing?
Having said so, I'll try to fully disabled the GPU recovery ( ie setting Vista register HKLM\system\surrentcontrolset\control\graphicdrivers\TdrLevel to 0 which means No recover )
Anyway, just basic questions to clarify things: when GPU is exerciced through a processing loop involving several kernel treatments,
as for instance in :
for (int i = 1; i<=100; i++) {
kern_1(...);
kern_2(...);
kern_3(...);
}
(1) what is considered to be the total GPU treatment time (when does Vista triggers the timer watchdog?)
(2) assume kern_3 treatment if fully independent of kern_2, how is the overall synchronization performed? i.e is there any waiting point just to insure that the here 100 "loops" are not collapsing into a mega-stream treatment?
if yes, shouldn't we have a wait statement somewhere ? such as
for (int i = 1; i<=100; i++) {
kern_1(...);
kern_2(...);
kern_3(...);
wait_completion (kern_2,kern_3);
}
Thanks for clarifying the matter.
Regards
Jean-Claude
Hi Jean-Claude,
GPUs can't run different kernels at the same time. So, there is a implicit synchronization after each kernel. It is possible that if you don't flush the command queue for a particular kernel previos kernel keep running. But, brook+ does flush all the commands for each kernel call.
Thanks Gaurav,
This clarifies a little bit the matter.
Jus to let you know, I fully disabled the GPU recovery(changed vista registry tdrlevel to 0) and as a result :
I'm not getting "GPU recovery message" and neither blue screen (for the time being...) BUT the system get stalled after some time (4 to 5minutes) and need to be rebooted through power-down ....
Back to the possible cause :
It looks like after several successfull calls to the GPU kernel processing loop, one call doesn't return...
So Is there anyway to get inside the processing loop a status report from the GPU???
BTW. I'm using Directshow too, so could it be anykind of deadlock interaction here ?
Do you see this issue with only SDK 1.3 and Catalyst 8.12?
It will be a good idea if you try experimenting with CAL 1.2.1 and Catalyst 8.10, keeping Brook+ 1.3 version.
Hi again,
Same kind of issues did indeed exist under previous SDk and Catalyst versions ( although - from time to time - the application did run a little bit longer... )
I've been waiting for SDK1.3 hoping these issues would have been ironed up.
To be honest it has taken me a while to convert the project to new syntax (proposed under 1.3) ,and other, so I'm not very keen to go back to previous CAL and Driver Environment.
Especially my ATI driver experience has always been a real pain, since everytime I have to use "Mobility Modder" utility to have the new drivers installed on the portable PC I have dedicated to the project...
If really needed I'm ready to go back to a mixture of SDK1.3 and previous CAL/Catalyst versions and work blindly.
Having said so, it would certainly be more productive to figure out what can stall the GPU while processing a loop of kernels.
Is there any way to query the GPU within the loop so that we get some form of dynamic status??
Many thanks for your involvement and support.
Jean-Claude
Ok, after a long conf and debug session this is where I'm at :
(1) The program works almost ok provided it runs in debug mode under Visual Studio 2008, on a step per step mode (ie one break point in the middle of kernel loop treatment and finger pushing F5... for another loop execution)
This again suggests a real time misbehaviour of CAL drivers when dealing with long loop of kernel execution.
(2) If the debug breakpoint is removed then ... boom! blue screen (same scenario as reported before)
(3) In order to clarify system parameters this is the Vista register set-up
HKLM\System\CurrentControlSet\Control\GraphicsDrivers
DxgKrnlVersion: Don't know what it is. My value is 0x1053
TdrLevel=1 – Bug check on detected timeout
TdrDelay= 60
TdrDdiDelay= 30
TdrDebugMode=0
Any support is appreciated, thanks !
BTW. Just tried experimenting with CAL 1.2.1 and Catalyst 8.10, keeping Brook+ 1.3 version.
Same problem, even worse: the blue screen comes sooner!
I really think there is a recuring problem with CAL/Drivers implementation on Vista.