I have a work-station with three 7970Ghz Tahitis on which OpenCL simulations (self-written) run. After weeks without problems for simulation #1 now for simulation # 2 (most parts are identical, but #2 has several extensions) this night a Blue Screen of Death has occurred. I restarted today, and after ~ 9 hours I noticed that both programs run concurrently on the first Tahiti were hung-up (at the same time). GPU-Z showed no GPU activitiy for that Tahiti, not even any memory allocated on the device, so it appears there was a complete OpenCL driver disconnection. The other two Tahitis were running fine. The hung-up did not occur at the same calculation stage as the overnight BSOD.
OS is Windows 7 64-bit Professional, on each Tahiti two simulations (exactly the same OpenCL code and host binary) were run concurrently. Each simulation uses several command-queues for interlaced data transfer and kernel invocations (called from a single host thread though). Driver is Catalyst 13.9 -> deliberately, because that was working for all tests conducted and hence no upgrade to a newer driver version has ever been tried out. Simulation # 2 has been running fine for weeks last year (in nearly identical OpenCLcode manner, however the input was completely different - much larger and hence e.g. between kernel invocations much more time passed by).
Any ideas what is most likely the root of the problem, i.e. hardware defect, driver problem, or ev. an issue in my program that could cause this? Any specific hints for further testing?
In case the driver might be it, any information on a particularly stable OpenCL driver version released since? If I dare a driver upgrade, would it be possible to _completely_ unroll to 13.9, i.e. remove absolutely everything from a newer driver version tried out?
I cannot disclose the code here, sorry.
From your post, its very difficult to pinpoint what may be the root cause of the above problem. However, from the problem overview, I would suggest you to check the Simulation # 2 first, specially the inputs and consequences, and be sure the applications are doing the correct things. The reasons are: 1) both the simulations were running fine earlier; 2) problem has occurred after modifying the inputs for Simulation # 2; 3) the same setup has been used for long time without any error.
Another thing is, please check whether you're always getting the problem from the same Tahiti card or not. If so, you may do some experiments with your setup.
Now, coming to your driver upgrade question. Your driver is quite old one. There have been several releases after that. You may try the latest driver from here Download Drivers. I don't think there will be any issue to revert back to some older version in case you find any issue with the latest driver.
Thanks for your input. Troubles continued today and so I can also report more:
A slightly modified input configuration for simulation #2 (exactly the same host binary, the same OpenCL code, just slightly different code branches in one or two kernels taken due to different calculation specifications) now causes reproduceably rubbish results (e.g. negative values which are mathematically speaking impossible, and also exceed anything which could be expected given numerical imprecisions). I can reproduce exactly the same error sequence under at least 2 different Tahiti devices (installed in the same workstation), both if the program is run stand-alone on the device and if another programs run concurrently. The same program & input configuration runs perfectly fine on Nvidia hardware (a Geforce 540M) and as plain multi-threaded C++ program (C++ uses the same kernel code, where a macro enables a line with get_global_id(...) for OpenCL, while for C++ the ids are function parameters), also producing so far (several hours of running) exactly the same output.
-) compiling the program with "-cl-opt-disable" causes already a hang-up during program building. No explicit response any more, just memory consumption goes slowly up forever until at some stage all resources are consumed.
I suppose it's best to try out another driver as next step. Is there a version which is known to be particularly stable? From other recent posts in this forum (posted by timchist) I read that issues were observed for 14.12 but not 14.9. Shall I maybe give 14.9 a try?
The code is OpenCL 1.1. Any specific need to tell the compiler so during program building?
As you mentioned, you are able to reproduce the problem quite regularly, would it be possible to share a reproducible test-case with us? It would help us to investigate the problem from our end.
Regarding the driver, I would also suggest you to check the problem with other versions. Both the drivers you mentioned, catalyst 14.9 and 14.12, are stable versions. Catalyst 14.12 supports OpenCL 2.0 where as 14.9 doesn't. At this point, it would be difficult to say whether timchist's observations are also applicable to you or not. Anyway, trying the latest driver once is not a bad idea, I guess.
During the clBuildProgram, option -cl-std can be used to control the OpenCL C language version to use. The spec also says:
"If the –cl-std build option is not specified, the highest OpenCL C 1.x language version supported by each device is used when compiling the program for each device. Applications are required to specify the –cl-std=CL2.0 option if they want to compile or build their programs with OpenCL C 2.0."
I am still investigating the issue. For all of 14.9, 14.12 and 15.4 Beta I observe the same behavior: building the program fails, the driver gets completely unresponsive (no error message) and consumes tremendous amounts of memory (e.g. 8 Gb !). I aborted the build process after 15 minutes or so; I may try out too what happens if I keep it running overnight.
I have also tested older drivers, and with 13.1 I had more success: the program builds fine, run-time calculations are ok (same as under Nvidia hardware / C++), but the random hang-ups at runtime may still occur. These hang-ups are non-deterministic, sometimes the program finishes fine (the total simulation takes about 14 hours) without any problem, sometimes they occur after a few hours, sometimes closer to the end (but running for the full 14 hours is unfortunately more the exception). I am running more test suites with configurations similar to those from last year to see if the hang-ups are then not triggered.
I am quite certain that at least two independent problems are involved: bad code generation / failed compilation for >= 13.9, and the runtime hang-up. For 13.9 building the program takes about 5 minutes (kernels have several thousands line of code) - I was used to that, but now get quite suspicious as for 13.1 it takes a few seconds at most. I also noticed that for 13.9 temporarily too tremendous memory ressources get consumed (e.g. 6 GB). I'll investigate in more detail which kernel screws here things up and let you know if I can come up with a manageable test case. Current summary:
13.1: program builds fine (within seconds), runtime kernel calculation results are correct
13.9: program builds (after 5 minutes), but runtime kernel calculation results are incorrect (deterministically & reproduceably) for at least two code branches
>= 14.9 program fails to build entirely
Now an important question:
I may use 13.1 on may main workstation (so far I have done all tests on a test machine only, but I need to produce production results soon !), how do I properly remove _everything_ that a Catalyst version has installed so I can safely roll back to an older driver and not mess up different OpenCL toolchain versions? On my test machine I have initially run the Catalyst removal setup (following the official uninstallation guide: How-To Uninstall AMD Catalyst Drivers From A Windows Based System), but noticed that it leaves traces, e.g. in Windows\System32 amdocl.dll and opencl.dll are left. Different Catalyst versions also place different files (like dlls or exe) in the system folders, none of which get removed upon uninstallation.A friend has recommmended me the DDU tool (Wagnard Tools) for uninstalling, which I tried out and frankly did a much better job, apparently also properly cleaning up the system folders. What procedure does AMD recommend? Using this external DDU tool on a test machine is one thing, but for a production work-station a very different matter.