6 Replies Latest reply on Feb 3, 2015 7:37 AM by jtrudeau

    fglrx: NBody with GPU0 and high -x (num of particles) causes X to die/GPU@100%fan/power off required

    arwnz

      Hello all,

       

      Hardware is HD 5970. aticonfig --list-adapters:

       

      * 0. 06:00.0 AMD Radeon HD 5900 Series

        1. 07:00.0 AMD Radeon HD 5900 Series

       

      * - Default adapter

       

      Three monitors are connected to adapter 0 (VGA and DVI 1920x1200 monitors and a 1920x1080 monitor via an active DisplayPort adapter). Tearfree desktop is enabled.

       

      Debian fglrx-driver is 1:14.12-1 consisting of New upstream release 14.12 (2014-12-09) (14.501.1003).

       

      I can run NBody on adapter 1 with a large number of particles:

      cd ~/AMDAPPSDK-2.9-1/samples/opencl/bin/x86_64/

      ./NBody -d 1 -x 100000

      [N-body simulation - 99968 Particles, 4.63 FPS]

       

      On adapter 0 even 20000 particles causes X to crash, the GPU fan to roar at 100% and a power off/power on required to restore system stability. Just a hard reset is insufficient (the module fglrx can't be inserted upon rebooting). The card only returns to normal when the computer is fully powered off and powered on again.

       

      This indicates there is a buffer overflow in the OpenCL virtual machine where an OpenCL program can write to (GPU) memory that does not belong to it. This is a security issue whenever a user can execute OpenCL code since at a minimum this is a denial-of-service bug.

       

      If you want to test whether you are affected remember to first close all open applications and sync the hard disk. Be prepared that you will lose your desktop and have to reboot your machine. Be prepared that X may not come up after a reboot. Don't panic and remember that I said I had to power off the computer and turn it on again before the GPU was in a normal state.

       

      I would like to know whether the issue affects others but I do NOT take responsibility for potential hardware damage. There may be a reason the fan locks at 100%. Higher than normal power consumption may overstress borderline hardware (e.g. cause a weak power supply to fail). Ideally AMD will acknowledge the issue so others don't need to test for it.

        • Re: fglrx: NBody with GPU0 and high -x (num of particles) causes X to die/GPU@100%fan/power off required
          arwnz

          The problem appears cumulative so there may be a memory leak in the driver. At first I can run ./NBody -d 0 -x 17400 without issue [17280 particles - 59.95FPS]. If I stop and start NBody a few times the NBody windows goes black. Eventually X dies and all screens go blank. Keyboard doesn't work to perform a soft reset. In this case the GPU does not roar at 100% fan.

           

          ./NBody -d 0 -x 18050 causes the extreme symptoms with roaring 100% GPU fan that does not reset itself after pushing the hard reset button. After Debian boots and I startx I get this message:

           

          modprobe: ERROR: could not insert 'fglrx': No such device.

           

          The GPU is only recognised (and fan speed returns to normal) after powering off and powering on the system.

           

          The three monitors with triple buffering could be consuming up to 105MB+ of GPU memory depending on the monitor layout [e.g. 3840 x 2280 x 4 bytes-per-pixel x 3 buffers = 105,062,400 bytes]. GPU0 has 1024MiB of RAM. So about 10% of GPU0's RAM is reserved by the system. This isn't sufficient to explain why GPU0 cannot even display one fifth of the number of particles as GPU1.

            • Re: fglrx: NBody with GPU0 and high -x (num of particles) causes X to die/GPU@100%fan/power off required
              arwnz

              I recently noticed errors of this type in the kernel log:

               

              Xorg: page allocation failure: order:5, mode:0x2040d0

              CPU: 0 PID: 1072 Comm: Xorg Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.7-ckt2-1

              Hardware name:  EVGA  122-CK-NF68/122-CK-NF68, BIOS 6.00 PG 09/28/2007

              ...

              Call Trace:

              [<ffffffff81507263>] ? dump_stack+0x41/0x51

              [<ffffffff811401df>] ? warn_alloc_failed+0xdf/0x130

              [<ffffffff81156d22>] ? next_online_pgdat+0x22/0x50

              [<ffffffff81142c28>] ? drain_pages+0x28/0xa0

              [<ffffffff811444ba>] ? __alloc_pages_nodemask+0x8ca/0xb30

              [<ffffffff8118a36b>] ? kmem_getpages+0x5b/0x110

              [<ffffffff8118b92e>] ? fallback_alloc+0x15e/0x210

              [<ffffffffa0323b53>] ? drm_alloc+0xc3/0x1a0 [fglrx]

              [<ffffffff8118d492>] ? __kmalloc+0x1f2/0x4c0

              [<ffffffffa0323b53>] ? drm_alloc+0xc3/0x1a0 [fglrx]

              [<ffffffffa0323b53>] ? drm_alloc+0xc3/0x1a0 [fglrx]

              [<ffffffffa03341cb>] ? firegl_adl_escape+0x8b/0x190 [fglrx]

              [<ffffffff811632ff>] ? tlb_finish_mmu+0xf/0x40

              [<ffffffffa0334140>] ? _r6x_init_hw_ctx+0xd0/0xd0 [fglrx]

              [<ffffffffa032bd98>] ? firegl_ioctl+0x1f8/0x260 [fglrx]

              [<ffffffffa031a16a>] ? ip_firegl_unlocked_ioctl+0xa/0x10 [fglrx]

              [<ffffffff811b7d2f>] ? do_vfs_ioctl+0x2cf/0x4b0

              [<ffffffff8116cce9>] ? do_munmap+0x299/0x3a0

              [<ffffffff811b7f91>] ? SyS_ioctl+0x81/0xa0

              [<ffffffff8150d32d>] ? system_call_fast_compare_end+0x10/0x15

              Mem-Info:

              Node 0 DMA per-cpu:

              CPU    0: hi:    0, btch:   1 usd:   0

              CPU    1: hi:    0, btch:   1 usd:   0

              CPU    2: hi:    0, btch:   1 usd:   0

              CPU    3: hi:    0, btch:   1 usd:   0

              Node 0 DMA32 per-cpu:

              CPU    0: hi:  186, btch:  31 usd:   0

              CPU    1: hi:  186, btch:  31 usd:   0

              CPU    2: hi:  186, btch:  31 usd:   0

              CPU    3: hi:  186, btch:  31 usd:   0

              Node 0 Normal per-cpu:

              CPU    0: hi:  186, btch:  31 usd:   0

              CPU    1: hi:  186, btch:  31 usd:   0

              CPU    2: hi:  186, btch:  31 usd:   0

              CPU    3: hi:  186, btch:  31 usd:   0

              active_anon:308859 inactive_anon:60279 isolated_anon:0

              active_file:425071 inactive_file:756671 isolated_file:0

              unevictable:12 dirty:0 writeback:0 unstable:0

              free:187160 slab_reclaimable:184627 slab_unreclaimable:8895

              mapped:138460 shmem:14967 pagetables:6251 bounce:0

              free_cma:0

               

              That is, X is unable to allocate memory pages and fglrx appears to be involved.

               

              However I have discovered an extraordinary workaround.

               

              It's early days but I'm yet to see a page allocation failure after the workaround. Furthermore I can now allocate 100,000 particles on GPU0:

              ~/AMDAPPSDK-2.9-1/samples/opencl/bin/x86_64$ ./NBody -d 0 -x 100000

               

              X becomes much less responsive as 99,968 particles move at 2.55 FPS. But it works!

               

              GPU1 runs the same program at 4.64 FPS. This 82% improvement in performance is an example of the benefit of not needing to run X and OpenCL on the same GPU.

               

              Now for the workaround. My original monitor configuration was: "Three monitors are connected to adapter 0 (VGA and DVI 1920x1200 monitors and a 1920x1080 monitor via an active DisplayPort adapter)."

               

              After the workaround the three monitors are connected to adapter 0 in this configuration:

              1920x1200 via VGA (no change)

              1920x1080 via DVI->HDMI (instead of DisplayPort->HDMI)

              1920x1200 via an active DisplayPort adapter->DVI (instead of DVI->DVI)

               

              In technical terms, I swapped the monitors. This is an utterly bizarre workaround but I can think of two potential reasons swapping the monitors avoided the bug:

               

              (a) the HDMI 1920x1080 monitor via DisplayPort supported 8, 10 and 12 bpc output and defaulted to 10 bpc. After switching to a DVI monitor the connection only supports 8 bpc. There may be an error/memory leak with 30-bit colour support (even though I tried selecting 8 bpc).

               

              (b) Audio is now output via HDMI. This is a well exercised code path. Audio via DisplayPort on HD 5000 series hardware under Linux will be rare and is only supported by fglrx (not the radeon free software driver).

               

              The radeon driver may support OpenCL in the future. At this time OpenCL support is a work in progress (http://www.x.org/wiki/RadeonFeature/, refer Compute (OpenCL)).

              1 of 1 people found this helpful