AnsweredAssumed Answered

fglrx: NBody with GPU0 and high -x (num of particles) causes X to die/GPU@100%fan/power off required

Question asked by arwnz on Jan 7, 2015
Latest reply on Feb 3, 2015 by jtrudeau

Hello all,

 

Hardware is HD 5970. aticonfig --list-adapters:

 

* 0. 06:00.0 AMD Radeon HD 5900 Series

  1. 07:00.0 AMD Radeon HD 5900 Series

 

* - Default adapter

 

Three monitors are connected to adapter 0 (VGA and DVI 1920x1200 monitors and a 1920x1080 monitor via an active DisplayPort adapter). Tearfree desktop is enabled.

 

Debian fglrx-driver is 1:14.12-1 consisting of New upstream release 14.12 (2014-12-09) (14.501.1003).

 

I can run NBody on adapter 1 with a large number of particles:

cd ~/AMDAPPSDK-2.9-1/samples/opencl/bin/x86_64/

./NBody -d 1 -x 100000

[N-body simulation - 99968 Particles, 4.63 FPS]

 

On adapter 0 even 20000 particles causes X to crash, the GPU fan to roar at 100% and a power off/power on required to restore system stability. Just a hard reset is insufficient (the module fglrx can't be inserted upon rebooting). The card only returns to normal when the computer is fully powered off and powered on again.

 

This indicates there is a buffer overflow in the OpenCL virtual machine where an OpenCL program can write to (GPU) memory that does not belong to it. This is a security issue whenever a user can execute OpenCL code since at a minimum this is a denial-of-service bug.

 

If you want to test whether you are affected remember to first close all open applications and sync the hard disk. Be prepared that you will lose your desktop and have to reboot your machine. Be prepared that X may not come up after a reboot. Don't panic and remember that I said I had to power off the computer and turn it on again before the GPU was in a normal state.

 

I would like to know whether the issue affects others but I do NOT take responsibility for potential hardware damage. There may be a reason the fan locks at 100%. Higher than normal power consumption may overstress borderline hardware (e.g. cause a weak power supply to fail). Ideally AMD will acknowledge the issue so others don't need to test for it.

Outcomes