Is there a way to disable (hide) one GPU on a dual GPU card in OpenCL?
I'm using AMD 295X2 (dual 290X) GPU cards for OpenCL under Ubuntu Linux 14.04 LTS. One GPU on one card is starting to fail and often hangs Opencl. (if I bang the card just right it works for a short while, I suspect a BGA problem) I want to fully disable that GPU so I don't have to remove the card and loose two GPUs.
What I know so far.
1. I can't ignore (not use) that GPU because it hangs clinfo or clGetPlatformIDs(...) when starting OpenCL, it must be disabled or somehow skipped.
2. The two GPUs on a 295X2 are completely separate devices appearing as 2 cards and 2 PCI devices with 2 headers.
3. I can disable the bad GPU in linux at boot by making a file in /etc/udev/rules.d telling linux to remove that PCI device. lspci then shows nothing, gone.
4. Alas, even if I do disable it in Linux, clGetPlatformIDs() still sees and it hangs, so I assume OpenCL is rescanning the PCI bus?
Searching for help I found some ways to disable AMD GPUs in OpenCL but they disabled all GPUs of the same type or all on a single card. Nothing could reference a single device.
Why there is hope:
1. After calling clGetPlatformIDs() one can easily reference or skip individual PCI devices.
2. OverDrive (ODX) driver software (source code) does scan by PCI device address, could skip a device, and does not hang even when getting the GPUs temps.
Any feedback is much appreciated.
Solved! Go to Solution.
GPU_DEVICE_ORDINAL environmental parameter can be used to mask the visiblity of the GPUs seen by the OpenCL application in a multi-GPU setup.
AMD OpenCL Programming User Guide says:
Masking Visible Devices:
By default, OpenCL applications are exposed to all GPUs installed in the system; this allows applications to use multiple GPUs to run the compute task.
In some cases, the user might want to mask the visibility of the GPUs seen by the OpenCL application. One example is to dedicate one GPU for regular graphics operations and the other three (in a four-GPU system) for Compute. To do that, set the GPU_DEVICE_ORDINAL environment parameter, which is a comma-separated list variable:
• Under Windows: set GPU_DEVICE_ORDINAL=1,2,3
• Under Linux: export GPU_DEVICE_ORDINAL=1,2,3
Another example is a system with eight GPUs, where two distinct OpenCL applications are running at the same time. The administrator might want to set GPU_DEVICE_ORDINAL to 0,1,2,3 for the first application, and 4,5,6,7 for the second application; thus, partitioning the available GPUs so that both applications can run at the same time.
GPU_DEVICE_ORDINAL environmental parameter can be used to mask the visiblity of the GPUs seen by the OpenCL application in a multi-GPU setup.
AMD OpenCL Programming User Guide says:
Masking Visible Devices:
By default, OpenCL applications are exposed to all GPUs installed in the system; this allows applications to use multiple GPUs to run the compute task.
In some cases, the user might want to mask the visibility of the GPUs seen by the OpenCL application. One example is to dedicate one GPU for regular graphics operations and the other three (in a four-GPU system) for Compute. To do that, set the GPU_DEVICE_ORDINAL environment parameter, which is a comma-separated list variable:
• Under Windows: set GPU_DEVICE_ORDINAL=1,2,3
• Under Linux: export GPU_DEVICE_ORDINAL=1,2,3
Another example is a system with eight GPUs, where two distinct OpenCL applications are running at the same time. The administrator might want to set GPU_DEVICE_ORDINAL to 0,1,2,3 for the first application, and 4,5,6,7 for the second application; thus, partitioning the available GPUs so that both applications can run at the same time.
This works fine when all GPUs are healthy. OpenCL / clinfo see only the enumerated GPUs.
In my case with a failing GPU, the display still hangs just as X starts. The last line in Xorg.0.log shows fglrx enumerating the faulty GPU (no. 5) but the computer runs fine and I can log in from another machine. Case 2. If I tell the kernel to 'remove' (ignore) GPU5's fglrx PCI device (PCI:0b:00.0), fglrx still enumerates GPU5 but X runs fine! However in this case, clinfo seg faults.
It may work if fglrx / xorg.conf also had a way to completely skip a PCI display device without killing it in Linux. I tried a few options including Option "Ignore" = "true" in the device section without success.
Since GPU_DEVICE_ORDINAL answers the original question, I marked this as the correct answer. If I find a work around for the bad GPU I'll come back and post it here.
Many thanks for the answer, its a useful option to know about.