I have a work-station with three 7970Ghz Tahitis on which OpenCL simulations (self-written) run. After weeks without problems for simulation #1 now for simulation # 2 (most parts are identical, but #2 has several extensions) this night a Blue Screen of Death has occurred. I restarted today, and after ~ 9 hours I noticed that both programs run concurrently on the first Tahiti were hung-up (at the same time). GPU-Z showed no GPU activitiy for that Tahiti, not even any memory allocated on the device, so it appears there was a complete OpenCL driver disconnection. The other two Tahitis were running fine. The hung-up did not occur at the same calculation stage as the overnight BSOD.
OS is Windows 7 64-bit Professional, on each Tahiti two simulations (exactly the same OpenCL code and host binary) were run concurrently. Each simulation uses several command-queues for interlaced data transfer and kernel invocations (called from a single host thread though). Driver is Catalyst 13.9 -> deliberately, because that was working for all tests conducted and hence no upgrade to a newer driver version has ever been tried out. Simulation # 2 has been running fine for weeks last year (in nearly identical OpenCLcode manner, however the input was completely different - much larger and hence e.g. between kernel invocations much more time passed by).
Any ideas what is most likely the root of the problem, i.e. hardware defect, driver problem, or ev. an issue in my program that could cause this? Any specific hints for further testing?
In case the driver might be it, any information on a particularly stable OpenCL driver version released since? If I dare a driver upgrade, would it be possible to _completely_ unroll to 13.9, i.e. remove absolutely everything from a newer driver version tried out?
I cannot disclose the code here, sorry.