My institute has invested in 'small' GPU cluster based on Radeon GPUs, all in hopes of doing rough OpenCL computations. I have installed Ubuntu 13.04 and Ubuntu 12.10, both show the same faulty behavior, namely that the default adapter get's lost almost instantly.
In my first encounter of the issue, I have run watch -n 0.5 'aticonfig --adapter=ALL --odgc' and it worked fine, however when I ran firefox, it broke saying "Maximum number of clients reached". When I closed it, it all worked fine. Then after many fiddling around with configuration (both HW and SW), I have gotten to a point where after boot all 4 cards were visible. aticonfig, lspci, then I ran clinfo, and it already whined about the number of clients reached, after that I practically lost my default adapter. Terminal output is attached as file.
It would be nice, if someone could shed some light as to what might be going on, because we invested quite a lot into these machines, and right now I have no idea what could be the cause. 2 cards worked fine, 4 do not, and I highly doubt it would be power related, as there are 2*1600W PSUs inside that are aggregated together practically working as auto-redundant power supplies, so even 1 of the PSUs must manage with the computer, not to mention 2.
The computer is ASUS ESC4000 G2/FDR with 4X HD7970, OS is Ubuntu 12.10 64-bit, drivers are Catalyst 13.4. Any suggestions?
Thank you nou for the quick reply. I have posted in one of your linked topics, but sadly no answer has arrived yet.
I really don't like bumping my own topics, but the issue is rather urgent, not to mention I think it is quite funny (to say the least), that the flagship card of AMD don't work under linux properly. Not out of the box, but anyhow. Using my card for OpenCL alone is simply not what I am looking for.
I have looked maybe I could set serverflags in xorg.conf, but it seems that the -noreset option has no corresponding serverflag (which is kinda funny to create a config file for an application, and leave things that can only be issued on the command-line). I looked at creating custom X sessions, maybe I could tweak something there. First I tried creating .xinitrc with .xsession and have lightdm invoke it, but /etc/X11/XSession does not accept parameters. Then I tried tweaking lightdm that does the X init to add a few options inside /etc/lightdm/lightdm.conf. xserver-command should do the job, however nothing really happens if I add -noreset there.
Could someone help getting a graphical interface working with 4 cards that works both when logging in at the machine itself, and both remotely. I'm starting to lose faith whether all this is going to work even if I could pass these options to X. I read about what -noreset does and other projects have problems with it also (although it is a feature), but it's really a joke that one intentionally strays aside from new dual VGAs for linux compute (sry, we had to wait roughly 8 months before HD5970 drivers were mature enough), picks HD7970, and it cannot even run desktop. Desktop flickers, compiz crashes all the time, and cards disappear from OpenCL (and the driver itself). All these are related issues, but a high-end product shouldn't be released like this. Seriously, is this what we should expect in the future?
p.s.: I found no reference mentioning the -ac flag's pourpose. What does that do? I did all my inverstigations with and without -ac, not just -noreset.
man page of X http://www.x.org/archive/X11R7.5/doc/man/man1/Xserver.1.html -ac is for disabling access control per host. -noreset prevent server reset when last connection is closed.
look at /usr/share/doc/lightdm/lightdm.conf.gz there is xserver-command option
Thank you nou for the help, my /etc/lightdm/lightdm.conf file looks like this:
xserver-command=X -ac -noreset
The only line I added is the xserver-command, the rest were there by default. The same things happen as before. Just staring at the desktop after login, waiting 10-20 seconds is enough to see black lines flicker through the background indicating X state reset. Running clinfo a few times proves that nothing has changed. Any more tips where the actual X invocation occurs? Because certainly this had no effect.
I added xserver-command as you did and ps aux | grep X reveals that xserver is running indeed with -noreset option. So it is possible that xserver is running with parameter but it doesn't have desirable effect.
Thanks again for your time. Indeed it is running with noreset. In this case the warning claiming that this is no solution to use cases outside GPGPU did not refer to the means of starting X (without tunneling or GUI login) but to the fact that OpenGL (or some other part) does still mess things up.
I would highly appreciate if someone with deeper knowledge of all the bleeding wounds of the driver could tell me how to get an interop capable environment running, first locally. (Remoting graphics can introduce whole sets of new problems, specially with interop.) Or worst case scenario, issue a hotfix beta driver in the near future.
p.s.: I really do appreciate that finally the drivers are documented not just in terms of graphics improvement in games, but in terms of OpenCL runtime changes, this is cool stuff and we've been waiting for it. But there is that last section called "Known Issues"... I would be most glad if stuff like this (that were definately well known when releasing the HD7000 series) were documented there, because naive customers like me would expect similar behavior to prior products. It's not a shame documenting something if it doesn't work yet. It's much worse if it isn't documented, although the issue is detected. (I certainly hope so bugs like this, that a series of products do not work with graphics and compute together do not go unnoticed during testing)
p.s.2: forgive my manners, but this week got the better of me, as can be seen in the Qt5.1 topic as well.
" I would be most glad if stuff like this (that were definitely well known when releasing the HD7000 series) were documented there,"
I agree. You are not the only one upset. The box for these also advertises support for xp, but there is no opencl for xp. So one could sue them for false advertising. Or just return all your cards and get some titans? I think they have better linux support.
NV does have better SW support (it's funny how years cannot change the "AMD's got the HW, NV's got the SW" phrase), however NV doesn't support OpenCL. Almost like whatsoever. We know by an unofficial inside info, that NV 2 months back had no intention of supporting OpenCL 1.2. If there is not even ongoing work on support 1.2 and SPIR and all the goodness, I simply cannot vote on buying NV (not to mention compute inferiority compared to price) when most people at our institute tend to shift toward OpenCL, even those who used CUDA before.
So yes, it's a shame there is no good choice here. Either buy HW that has absolutely no guarantee that will evolve on the SW side, or buy HW that is said to support things, but infact it takes weeks (or months) before the real support is developed. It's our choice.
The reason I chose AMD is because I see bigger bigger potential in the HW, and bigger momentum on the SW side (Bolt, OpenCL, Video Decode, Video Encode APIs...). However it comes at the price of fighting for some things that should really be working out-of-the-box.
Let me follow up on this topic. The 13.6 beta Catalyst claiming to work with Ubuntu 13.04 does indeed render the desktop correctly, and apps, including interop apps, work correctly. However I cannot get multi-GPU (and multi-interop) working. After installing the driver I issued:
aticonfig --initial --adapter=ALL -f
and I added
xserver-command=/usr/bin/X -ac -noreset
to the lightdm config, and thus when I ssh into the machine without any X forwarding, I see "ps -aux | grep X" report:
root 1971 0.0 0.3 657516 106924 tty7 Ss+ Jun10 0:42 /usr/bin/X -ac -noreset -core -auth /var/run/lightdm/root/:0 -nolisten tcp vt7 -novtswitch -background none
aticonfig --list-adapters returns all my GPUs, so does clinfo. However when I try to do a stress test on all the GPUs, namely run the samples in the SDK with -d 0 through 3, Only the default adapter heats up when I check with "aticonfig --adapter=ALL --odgt", as if only the default adapter would process kernels. Naturally I have to issue export DISPLAY=:0 in the console where I "watch -n 0.5" the temperatures, and in the other terminal where I run the app it does not matter whether: 1) I leave my env default 2) I export COMPUTE=:0 3) or I export DISPLAY=:0 . Everytime only the default device heats up (builds GPU load and throttles clocks), so I very much suspect that is the only device working.
Am I doing something wrong or is this a limitation of the 13.6 Beta1 driver?
Edit: one minor correction, multi-GPU and multi-interop is working so long I use it on the desktop while logged in. The default adapter taking all the work only happens when running apps remotely.
Thank you for the reply. Even answers like "yes, we read it, but nothing is happening" are welcome.
I certainly hope there is chance that either some setup-related fix is given, or if the driver is faulty then the final 13.6 will fix the issue, because if it doesn't we'll have to wait at least another month before the first beta will be issued with a probable fix.
Since 13.6 should be out by now I suspect both the SDK and Catalyst are delayed by the imminent OpenCL spec update. My guess is that that the SDK will be released shortly after the spec, and it must be accompanied by driver that supports it, and the next WHQL release will be 13.8 which is quite far away.
Some hints from AMD would be apreciated, because there's not much I can do apart from knowing all the legacy pains of the driver. This seems like a relatively new issue, either originating from the OS or the driver-HW relation.
Take it there is no fix. My Ubuntu PC is next to useless. The Beta version of the driver causes a kernel panic.
It seems to be any app that has 3d support, causes loads of these messages to appear in dmesg
<3>[fglrx:firegl_apl_loadDatabase] *ERROR* APL: apl initialize fail.
And then I cant open any more apps.
Maximum number of clients reached
I have found a work around for now. If I open loads of copies of the 'kate' text editor (or other basic app I guess), as soon as X starts. When I open an app which fails to start ("Maximum number of clients reached"), I close one of the copies of kate, which frees up X.