FirePro V7900 - Driver gets stuck when trying to do OpenCL computation
I'm having trouble getting four FirePro V7900 cards to work under GNU/Linux. As part of a research project, my workgroup and I acquired these four cards focusing on OpenCL computing capability. We have tested the cards in several nodes and we always get the same result, namely that the cards (or the driver) hang when trying to do something with them, such as starting the X server or issuing the "clinfo" command.
The current test bed is a node within a computing cluster (Dual Xeon E5-2660 CPUs, 64GB RAM), using latest fglrx driver (rev number 9.003.3) and AMD APP SDK v2.8; It's a Debian 6.0 box running stable 3.2.0 kernel under x86_64 architecture. Anyway, the cards were previously tested in a different machine (Stand-alone high-end PC Intel Core-i7 with quad PCI-Express motherboard support) and the outcome was the same as presented here.
Here are several useful logs and outputs:
- Generic uname information:
# uname -a
Linux verode18 3.2.0-0.bpo.3-amd64 #1 SMP Sun Feb 25 22:41:30 UTC 2013 x86_64 GNU/Linux
- lspci sees the cards (currently only 3 of the 4 cards are connected):
Everything looks fine up to this point. Now, problems start to arise. For instance, the "clinfo" command (which lists all OpenCL-capable devices found on the machine) hangs the computer, resulting in a 100% CPU consuming process which turns out to be impossible to kill. This seems a clear symptom of kernel-level troubles, such as the driver getting stuck on I/O deadlock or nasty stuff like that.
I have attached the output of this command to this post (too long for pasting here):
As stated before, "kill -9 <PID>" will have no effect. By the way, I've observed that those device nodes are created dynamically by the driver as need be, but I've also tried to create them manually after a reboot running the following command:
# mkdir /dev/ati; for i in `lspci | grep VGA | grep ATI | wc -l`; do mknod -m 666 /dev/ati/card$i c 250 $i; done
It makes no difference later, though.
Any idea about what could be causing this odd behaviour? Everything points to a low-level problem, be it driver or hardware issues (it's hard to believe that 4 cards are faulty in the same way, nonetheless). As a matter of fact, we have 2 V9800 cards in another node of the same computing cluster, and we got them to work very easily following the same steps we are taking with this ones. I can provide logs from both machines for comparison's sake, if you find it worthwhile. Suggestions are greatly appreciated.