7 Replies Latest reply on Feb 24, 2011 8:07 PM by Otterz

    Detecting device resets



      Is it possible to detect if the display driver has been reset? The scenario is that I am using windows 7, with TDR enabled (and I do not want to rely on that being changed). In my app, I would like to know if windows reset the device.

      What I am observing is that if TDR occurs, the long running batch of kernels gets killed, the device reset, event.wait() then returned CL_SUCCESS for the killed batch of kernels, and my app merrily goes on submitting more NDRange Kernels (with enqueue returning no error), and the subsequent event.waits() all returning no errors.

      But when I reach the point that I want call queue.finish()/flush(), it will hang indefinitely.

      Looking at the API I am not seeing the correct way to cause the program to abort in the event that a batch of kernels gets killed.

      I have tried registering a callback when creating the context, but that callback does not seem to be called (or perhaps I coded it incorrectly)

      void CL_CALLBACK contextCallback(const char *errinfo, // Pointer to an error string const void *private_info, // Binary data .. not sure how to use it size_t cb, // amount of above data void *user_data){ // User supplied data??? std::cout << "Context callback called with error message:" << std::endl << errinfo << std::endl; exit(EXIT_FAILURE); } cl::Context context( CL_DEVICE_TYPE_GPU, // Create context for a CPU cprops, NULL, &contextCallback, &err); checkErr(err, "Context::Context()");

        • Detecting device resets


          As per my experience programs does not hang when TDR time out happens. The hangs are generally because of some error.

          Also you will need to call clFinish after every kernel to find out which kernel exceeds TDR limit. clEnqueueNDRangeKernel does not guarantee the completion of kernel it just enqueues it in the commandqueue. 

          When TDR Happens you get a system message saying Driver was restarted. You can set the TDR value as per your needs so that your kernel doesnot timeout in normal execution.

            • Detecting device resets

              The program does not hang when the TDR happens, the program hangs later when queue.finish() is called.


              What bothers me is I can do this:


              enqueueNDRange( ..., &my_event)


              more openCL stuff



              And IF, the above NDRange kernel causes a TDR reset, my_event.wait() still returns CL_COMPLETE. When in fact, the kernel did NOT complete - it was killed by the OS.

              But when I call queue.finish(), that is when I will hang.