cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

chaosed0
Adept II

Detecting GPU over SSH

I would like to be able to detect a GPU on another computer over SSH. I know this question has been asked before, but I have tried everything suggested and I still can't get it to work.

What I do is: I remote in using the command "ssh -X COMPUTERNAME". Running gedit pops up a text editor window, so X forwarding is working fine. I do "sudo chmod uog+rw /ati/dev/card*" and "export COMPUTE=:0". Then, I try to run clinfo, but only the CPU is shown. However, when I do fglrxinfo, it does detect the GPU. If I run clinfo sitting at the computer, it does show the GPU. Here's what I'm using:

OS: Ubuntu 12.04 (both host and client)

GPU: ATI Firepro V9800

SDK: AMD APP SDK v2.7

By the way, the end goal is to get these computers running as a cluster if anyone has info on that as well.

Thanks.

0 Likes
11 Replies
mfried
Adept II

Don't forget you also need to do "xhost +" from your "COMPUTERNAME" machine. Otherwise your ssh session won't be able to connect to the X server as specified with COMPUTE=(localhost):0. Alternatively, you can set "xhost +@localhost" or "xhost +user@localhost" to limit who is able to send popup X windows and use GPU resources on the "local" X session.

You might want to take a look here -- it recommends several ways of integrating the xhost + into your X initialization depending on distribution.

http://developer.amd.com/sdks/AMDAPPSDK/assets/App_Note-Running_AMD_APP_Apps_Remotely.pdf

I've read through that document a couple of times. Unfortunately, it seems like it hasn't been updated in a while, because it refers to GDM in the ubuntu section - Ubuntu now uses lightdm. Either way, I have tried running xhost + before I export COMPUTE and it seems to have no effect.

Did OpenCL work from a hardware-accelerated X session on the box? If so, try doing "xhost +" there.

Remember that you need to run "xhost +" from _that_ session on your local X server. You can't simply run it from your SSH session into the box. Regardless of whatever Display Manager you have (you say it's lightdm), it probably has some initialization script that you can push an "xhost +" into so that when you reboot the main session has access control disabled.

You need to export COMPUTE from your SSH session to tell that AMD OpenCL implementation within session where to connect, and I've tried all the sane things - my OpenCL test cluster has nodes named master, node2, node3, node4, etc (all on a local switch behind a firewall). COMPUTE=:0 or COMPUTE=:0.0 are the only ways which work, regardless of what the X environment variable DISPLAY accepts. Doing COMPUTE=node2:0 or node2:0.0 from node2, for example will cause AMD's OpenCL implementation to hang. All the other things I've tried such as node3:0 from node2 just fail to connect / find any GPUs (even when DISPLAY=node3:0 will cause an X window to run on that node locally due to "xhost +"). If you have unset DISPLAY then COMPUTE doesn't need to be set to find the local GPU (not sure about multi-AMD-GPU), otherwise you need to set it.

I tried this from ssh as well as ssh -X (to tunnel X through master), and the behaviors are consistent. If you use DISPLAY to have an X window on your remote terminal, you need to set COMPUTE=:0 to access the GPU on the system you ssh into (for AMD -- for NVIDIA you can either have NVIDIA running in an X server _or_ use "nvidia-smi -pm 1" in a startup script as root to initialize and create a "persistent mode" GPU session).

Let us know if this helps.

Thanks for your help so far.

Well here's something interesting. For the sake of this discussion, I'll call this computer A and the other computer B.

After following the steps you listed (doing xhost + on computer B, exporting COMPUTE), A was still not able to detect the GPU on B. However, I got on B, ssh'd to A, did "export COMPUTE=:0", and clinfo was able to see the GPU! I didn't execute any commands other than the export. Before I did that, clinfo would segfault.

I don't really know what to think now. I guess it's obviously something to do with either computer B's graphics settings or computer A's SSH/X settings. Is there anything you can come up with to diagnose this issue?

0 Likes

It sounds like your remote computer -- Computer B -- has a bad setup. Make sure that it works locally first:

Reboot B (this sometimes fixes problems)

Login on B's console X session

Run clinfo

If this fails, reboot, re-install the drivers, etc, reboot again, etc, until clinfo runs from a local X session.

Then make sure that your display manager has the appropriate bits to enable remote access to the GPU (the chmod stuff in your first post in this thread as well as an "xhost +"). You can confirm that access control is disabled by logging off of your X session on B, going to A and doing "export DISPLAY=B:0" followed by xhost to query the status of the session on B:0.

As long as OpenCL still detects your GPU when you log in to B locally, try the following from A while logged into B and then again after logging out from your X session on B:

ssh username@B (no other parameters -- no -X for example)

export COMPUTE=:0 (not really necessary if logged into B from the same account)

Run clinfo

exit / logout

If that works, you can try adding -X and firing up an xterm on A from B through X11 SSH forwarding... If it segfaults there, you might want to try "unset DISPLAY" from within that xterm and trying it again -- perhaps there is some X11 OpenGL / OpenCL issue that you have found if that is the case.

If this is still segfaulting try "ls -la /dev/ati/card*" for me, I see something like this on 3 different nodes all of which have 1 each ATI card or AMD APU:

node2:~ # ls -la /dev/ati/card*

crw-rw-rw- 1 root root 251, 0 Jun 21 14:46 /dev/ati/card0

0 Likes

it should work without xhost + if you log in as same user to X session and over ssh.

Right. However, to operate headless, as you would want to do in a server rack in a datacenter, you can't require someone to physically log in on the console from a monitor whenever the system reboots. That's why you need to add "xhost +" to the Display Manager script -- to enable remote sessions to interact. I don't think you need to do a full "xhost +". In theory, you can get away with "xhost +@localhost" if you're worried about unauthenticated users from other X hosts attempting to sabotage your login screen, but the local login prompt shouldn't lose keyboard focus or get hidden under a window even if someone runs a rogue X app (because of defense in depth measures).

Generally, xhost + is a bad user default for shared multiple-user computers that people will log in on simultaneously. Imagine if a clueless remote user does "xterm &" with DISPLAY=:0 for example, and has no idea where it went... When they intended to ssh -X ... A malicious user on localhost might take advantage of something like that, or at least leave a little warning for them in their startup scripts.

NVIDIA's approach with their Linux stack for remote GPGPU consumption makes a bit more sense -- if you don't need an X server, you can initialize the CUDA / OpenCL stacks and create a GPU session by doing nvidia-smi -pm 1 (this works on GeForce cards as well as Tesla and Quadro). I do this on my OpenCL cluster nodes which have multiple devices so that only the AMD GPUs need to have the X server resources. One of my nodes, for example, pairs two Opteron 6274 CPUs with a Radeon HD 6970, GeForce GTX 560 Ti, and GeForce GTX 580 Ti (one free Gen 2.0 x16 slot is there for another Gen 2.0 x16 device like the Xeon Phi -- formerly called Knights Corner). Another node pairs two Xeon E5-2643 CPUs with a Radeon HD 7970 and a GeForce GTX 680 (and 2 more PCIe Gen 3.0 x16 slots are waiting for cards like Tesla K10 or K20 or FirePro W9000). Other nodes in this cluster have other devices (CPUs, APUs, GPUs).

One interesting side effect of having an NVIDIA OpenCL and AMD OpenCL in the same Linux system is that you can cause either OpenCL implementation to break all OpenCL applications -- in the NVIDIA case, if you remove all the NVIDIA GPU cards without uninstalling the driver or at least removing/renaming the nvidia.icd from the OpenCL ICD path it will cause the whole OpenCL stack to fail initialization, and in the AMD case, if you do an improper COMPUTE= value, it can hang in initialization.

pqxi68
Journeyman III

Hi,

I think it may be happened, but how it should be explained

Thank you.

0 Likes

It all turns out to be fine because eventually the objective is to make computers to function appropriately as clusters provided the required information is available

0 Likes
yurtesen
Miniboss

You should set COMPUTE=:0 also xhost +local is probably enough and more secure.

For lightdm (from my own post in ubuntu forums: http://ubuntuforums.org/showthread.php?t=1946770 )

You can solve the issue by giving shell to lightdm, setting DISPLAY=:0 in shell rc file and running su - lightdm -c xhost +local: from the rc.local. This is not quote elegant...

The reason is login prompt runs as lightdm user. But it is not possible to run xhost using lightdm user at boot without it having a shell, also it is not able to find display if the DISPLAY variable is not set...

You can test this without booting, give shell to lightdm user and do a su - lightdm then run (assuming bash) export DISPLAY=:0  and then  xhost +local  ... after that, login as a normal user and try to check clinfo... (of course local user should have COMPUTE or DISPLAY set to 0 as well...)

For setting at login the COMPUTE or DISPLAY, create a file at /etc/profile.d  lets say named compute.sh and put the following in... (it might be that COMPUTE is enough)

#!/bin/sh

export COMPUTE=:0

export DISPLAY=:0

export GPU_MAX_ALLOC_PRCENT=100

If you use DISPLAY=:0 you cant run remote X programs, normally COMPUTE should be enough but I think I had some problems without DISPLAY which I cant remember now

Thanks,

Evren

PS. I tested this in Ubuntu 11.x but it should work on 12.x also

chaosed0
Adept II

I'm going to put this issue on hold for now as nothing I've tried fixes the issue and I have other things I need to work on before this becomes relevant. If I come back to this and still have issues, I'll start up another thread. Thanks for all your help, though, I'm grateful.

0 Likes