cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Otterz
Journeyman III

Detecting device resets

Hi,

Is it possible to detect if the display driver has been reset? The scenario is that I am using windows 7, with TDR enabled (and I do not want to rely on that being changed). In my app, I would like to know if windows reset the device.

What I am observing is that if TDR occurs, the long running batch of kernels gets killed, the device reset, event.wait() then returned CL_SUCCESS for the killed batch of kernels, and my app merrily goes on submitting more NDRange Kernels (with enqueue returning no error), and the subsequent event.waits() all returning no errors.

But when I reach the point that I want call queue.finish()/flush(), it will hang indefinitely.

Looking at the API I am not seeing the correct way to cause the program to abort in the event that a batch of kernels gets killed.

I have tried registering a callback when creating the context, but that callback does not seem to be called (or perhaps I coded it incorrectly)

void CL_CALLBACK contextCallback(const char *errinfo, // Pointer to an error string const void *private_info, // Binary data .. not sure how to use it size_t cb, // amount of above data void *user_data){ // User supplied data??? std::cout << "Context callback called with error message:" << std::endl << errinfo << std::endl; exit(EXIT_FAILURE); } cl::Context context( CL_DEVICE_TYPE_GPU, // Create context for a CPU cprops, NULL, &contextCallback, &err); checkErr(err, "Context::Context()");

0 Likes
7 Replies
himanshu_gautam
Grandmaster

otterz,

As per my experience programs does not hang when TDR time out happens. The hangs are generally because of some error.

Also you will need to call clFinish after every kernel to find out which kernel exceeds TDR limit. clEnqueueNDRangeKernel does not guarantee the completion of kernel it just enqueues it in the commandqueue. 

When TDR Happens you get a system message saying Driver was restarted. You can set the TDR value as per your needs so that your kernel doesnot timeout in normal execution.

0 Likes

The program does not hang when the TDR happens, the program hangs later when queue.finish() is called.

 

What bothers me is I can do this:

 

enqueueNDRange( ..., &my_event)

my_event.wait();

more openCL stuff

...

queue.finish()

And IF, the above NDRange kernel causes a TDR reset, my_event.wait() still returns CL_COMPLETE. When in fact, the kernel did NOT complete - it was killed by the OS.

But when I call queue.finish(), that is when I will hang.

 

 

0 Likes

To be more specific,

 

I am doing something like:

while( ... ){

enqueueNDRangeKernel( ..., &my_event);

my_event.wait();

}

stuff...

queue.finish();

 

And if an NDRangeKernel gets killed by the OS, the event.wait() is still returning CL_COMPLETE.

0 Likes

bump, anyone?? I need to know how to cope with this case!

0 Likes

otterz,

Thanks for reporting it. I was able to reproduce it at my side.

0 Likes

Originally posted by: himanshu.gautam

otterz,




Thanks for reporting it. I was able to reproduce it at my side.



This behavior not new to SDK 2.3, it was with very first SDK releases.
Strange to see that is " a new and only recently reproduced" problem to support group... There is no correct way for program to be informed of driver failure/restart.
And this is really bad, especially when app should run unattended.
0 Likes

Thanks for all of the replies!

Thankfully the queue.finish() is hanging, meaning I never write results or pretend the app exited successfully. Hopefully this can be addressed in a later driver.

0 Likes