cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

arwnz
Adept I

fglrx: NBody with GPU0 and high -x (num of particles) causes X to die/GPU@100%fan/power off required

Hello all,

Hardware is HD 5970. aticonfig --list-adapters:

* 0. 06:00.0 AMD Radeon HD 5900 Series

  1. 07:00.0 AMD Radeon HD 5900 Series

* - Default adapter

Three monitors are connected to adapter 0 (VGA and DVI 1920x1200 monitors and a 1920x1080 monitor via an active DisplayPort adapter). Tearfree desktop is enabled.

Debian fglrx-driver is 1:14.12-1 consisting of New upstream release 14.12 (2014-12-09) (14.501.1003).

I can run NBody on adapter 1 with a large number of particles:

cd ~/AMDAPPSDK-2.9-1/samples/opencl/bin/x86_64/

./NBody -d 1 -x 100000

[N-body simulation - 99968 Particles, 4.63 FPS]

On adapter 0 even 20000 particles causes X to crash, the GPU fan to roar at 100% and a power off/power on required to restore system stability. Just a hard reset is insufficient (the module fglrx can't be inserted upon rebooting). The card only returns to normal when the computer is fully powered off and powered on again.

This indicates there is a buffer overflow in the OpenCL virtual machine where an OpenCL program can write to (GPU) memory that does not belong to it. This is a security issue whenever a user can execute OpenCL code since at a minimum this is a denial-of-service bug.

If you want to test whether you are affected remember to first close all open applications and sync the hard disk. Be prepared that you will lose your desktop and have to reboot your machine. Be prepared that X may not come up after a reboot. Don't panic and remember that I said I had to power off the computer and turn it on again before the GPU was in a normal state.

I would like to know whether the issue affects others but I do NOT take responsibility for potential hardware damage. There may be a reason the fan locks at 100%. Higher than normal power consumption may overstress borderline hardware (e.g. cause a weak power supply to fail). Ideally AMD will acknowledge the issue so others don't need to test for it.

0 Likes
6 Replies
arwnz
Adept I

The problem appears cumulative so there may be a memory leak in the driver. At first I can run ./NBody -d 0 -x 17400 without issue [17280 particles - 59.95FPS]. If I stop and start NBody a few times the NBody windows goes black. Eventually X dies and all screens go blank. Keyboard doesn't work to perform a soft reset. In this case the GPU does not roar at 100% fan.

./NBody -d 0 -x 18050 causes the extreme symptoms with roaring 100% GPU fan that does not reset itself after pushing the hard reset button. After Debian boots and I startx I get this message:

modprobe: ERROR: could not insert 'fglrx': No such device.

The GPU is only recognised (and fan speed returns to normal) after powering off and powering on the system.

The three monitors with triple buffering could be consuming up to 105MB+ of GPU memory depending on the monitor layout [e.g. 3840 x 2280 x 4 bytes-per-pixel x 3 buffers = 105,062,400 bytes]. GPU0 has 1024MiB of RAM. So about 10% of GPU0's RAM is reserved by the system. This isn't sufficient to explain why GPU0 cannot even display one fifth of the number of particles as GPU1.

0 Likes

I recently noticed errors of this type in the kernel log:

Xorg: page allocation failure: order:5, mode:0x2040d0

CPU: 0 PID: 1072 Comm: Xorg Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.7-ckt2-1

Hardware name:  EVGA  122-CK-NF68/122-CK-NF68, BIOS 6.00 PG 09/28/2007

...

Call Trace:

[<ffffffff81507263>] ? dump_stack+0x41/0x51

[<ffffffff811401df>] ? warn_alloc_failed+0xdf/0x130

[<ffffffff81156d22>] ? next_online_pgdat+0x22/0x50

[<ffffffff81142c28>] ? drain_pages+0x28/0xa0

[<ffffffff811444ba>] ? __alloc_pages_nodemask+0x8ca/0xb30

[<ffffffff8118a36b>] ? kmem_getpages+0x5b/0x110

[<ffffffff8118b92e>] ? fallback_alloc+0x15e/0x210

[<ffffffffa0323b53>] ? drm_alloc+0xc3/0x1a0 [fglrx]

[<ffffffff8118d492>] ? __kmalloc+0x1f2/0x4c0

[<ffffffffa0323b53>] ? drm_alloc+0xc3/0x1a0 [fglrx]

[<ffffffffa0323b53>] ? drm_alloc+0xc3/0x1a0 [fglrx]

[<ffffffffa03341cb>] ? firegl_adl_escape+0x8b/0x190 [fglrx]

[<ffffffff811632ff>] ? tlb_finish_mmu+0xf/0x40

[<ffffffffa0334140>] ? _r6x_init_hw_ctx+0xd0/0xd0 [fglrx]

[<ffffffffa032bd98>] ? firegl_ioctl+0x1f8/0x260 [fglrx]

[<ffffffffa031a16a>] ? ip_firegl_unlocked_ioctl+0xa/0x10 [fglrx]

[<ffffffff811b7d2f>] ? do_vfs_ioctl+0x2cf/0x4b0

[<ffffffff8116cce9>] ? do_munmap+0x299/0x3a0

[<ffffffff811b7f91>] ? SyS_ioctl+0x81/0xa0

[<ffffffff8150d32d>] ? system_call_fast_compare_end+0x10/0x15

Mem-Info:

Node 0 DMA per-cpu:

CPU    0: hi:    0, btch:   1 usd:   0

CPU    1: hi:    0, btch:   1 usd:   0

CPU    2: hi:    0, btch:   1 usd:   0

CPU    3: hi:    0, btch:   1 usd:   0

Node 0 DMA32 per-cpu:

CPU    0: hi:  186, btch:  31 usd:   0

CPU    1: hi:  186, btch:  31 usd:   0

CPU    2: hi:  186, btch:  31 usd:   0

CPU    3: hi:  186, btch:  31 usd:   0

Node 0 Normal per-cpu:

CPU    0: hi:  186, btch:  31 usd:   0

CPU    1: hi:  186, btch:  31 usd:   0

CPU    2: hi:  186, btch:  31 usd:   0

CPU    3: hi:  186, btch:  31 usd:   0

active_anon:308859 inactive_anon:60279 isolated_anon:0

active_file:425071 inactive_file:756671 isolated_file:0

unevictable:12 dirty:0 writeback:0 unstable:0

free:187160 slab_reclaimable:184627 slab_unreclaimable:8895

mapped:138460 shmem:14967 pagetables:6251 bounce:0

free_cma:0

That is, X is unable to allocate memory pages and fglrx appears to be involved.

However I have discovered an extraordinary workaround.

It's early days but I'm yet to see a page allocation failure after the workaround. Furthermore I can now allocate 100,000 particles on GPU0:

~/AMDAPPSDK-2.9-1/samples/opencl/bin/x86_64$ ./NBody -d 0 -x 100000

X becomes much less responsive as 99,968 particles move at 2.55 FPS. But it works!

GPU1 runs the same program at 4.64 FPS. This 82% improvement in performance is an example of the benefit of not needing to run X and OpenCL on the same GPU.

Now for the workaround. My original monitor configuration was: "Three monitors are connected to adapter 0 (VGA and DVI 1920x1200 monitors and a 1920x1080 monitor via an active DisplayPort adapter)."

After the workaround the three monitors are connected to adapter 0 in this configuration:

1920x1200 via VGA (no change)

1920x1080 via DVI->HDMI (instead of DisplayPort->HDMI)

1920x1200 via an active DisplayPort adapter->DVI (instead of DVI->DVI)

In technical terms, I swapped the monitors. This is an utterly bizarre workaround but I can think of two potential reasons swapping the monitors avoided the bug:

(a) the HDMI 1920x1080 monitor via DisplayPort supported 8, 10 and 12 bpc output and defaulted to 10 bpc. After switching to a DVI monitor the connection only supports 8 bpc. There may be an error/memory leak with 30-bit colour support (even though I tried selecting 8 bpc).

(b) Audio is now output via HDMI. This is a well exercised code path. Audio via DisplayPort on HD 5000 series hardware under Linux will be rare and is only supported by fglrx (not the radeon free software driver).

The radeon driver may support OpenCL in the future. At this time OpenCL support is a work in progress (http://www.x.org/wiki/RadeonFeature/, refer Compute (OpenCL)).

My desktop has been unstable with the fglrx driver. There is an intermittent display freeze where the GPU will lock up (the non-graphics components of the computer continue to work). It can take days before X freezes. I doubt this will ever be resolved in a later release of the proprietary driver.

I think this blog post indicates how little has changed in three years:

AMD’s OpenCL heaven and hell | Wonderings of a SAT geek

Reluctantly I have replaced the GPU with a competing product.

0 Likes

As you mentioned in your previous thread that you got a workaround for the n-body issue. Is this display issue related to same n-body application? If not, could you please be more explicit?

Regards,

0 Likes

Dipak, the graphics card is not stable as a primary desktop GPU.

After boot I executed this script:

#!/bin/bash

clinfo | grep 'Device Type:' &&

chown root:video /dev/ati/card[0-1] &&

chmod 660 /dev/ati/card[0-1] &&

ls -lah /dev/ati/*

This initialised OpenCL on both devices and set the permissions to something more reasonable than world writable.

But the fglrx driver was not stable. X would eventually lock up even if I did not run an OpenCL program (to explicitly answer your question).

I do not believe AMD's proprietary (fglrx) driver is of sufficient quality to main a stable desktop with an HD 5970 and my particular hardware. I do not believe the driver will ever be of sufficient quality. Longer term the open (radeon) driver is likely to surpass the proprietary driver and gain features such as OpenCL.

I didn't write my last reply seeking help. I was noting for future readers that I did not find a solution to system stability with the HD 5970 as my primary GPU. I worked around the immediate problems but then started to encounter 2D desktop performance issues and intermittent lockups.

0 Likes

Adam,

Understand you're ready to say "Done." I know you’re not asking for help. But it is in our DNA to want to fix problems. I specifically asked Dipak to take a peek at this again, because as you note, there was no solution. I don't like that!


If we see someone having trouble, we want to get to root cause, and if it’s something we can fix, we drive to get it fixed. Doesn't always happen for any number of reasons, but it's the way we are. I appreciate the information you've already provided. We won't bug you.

0 Likes