0 Replies Latest reply on Feb 24, 2013 8:48 PM by asamarin

    FirePro V7900 - Driver gets stuck when trying to do OpenCL computation

    asamarin

      Hi everyone,

       

      I'm having trouble getting four FirePro V7900 cards to work under GNU/Linux. As part of a research project, my workgroup and I acquired these four cards focusing on OpenCL computing capability. We have tested the cards in several nodes and we always get the same result, namely that the cards (or the driver) hang when trying to do something with them, such as starting the X server or issuing the "clinfo" command.

       

      The current test bed is a node within a computing cluster (Dual Xeon E5-2660 CPUs, 64GB RAM), using latest fglrx driver (rev number 9.003.3) and AMD APP SDK v2.8; It's a Debian 6.0 box running stable 3.2.0 kernel under x86_64 architecture. Anyway, the cards were previously tested in a different machine (Stand-alone high-end PC Intel Core-i7 with quad PCI-Express motherboard support) and the outcome was the same as presented here.

       

      Here are several useful logs and outputs:

       

      - Generic uname information:

       

      # uname -a
      Linux verode18 3.2.0-0.bpo.3-amd64 #1 SMP Sun Feb 25 22:41:30 UTC 2013 x86_64 GNU/Linux
      

       

      - lspci sees the cards (currently only 3 of the 4 cards are connected):

       

      # lspci -v | grep -Pzo "(?s)^[[:xdigit:]:.]+ VGA.*?\n\n"
      04:00.0 VGA compatible controller: ATI Technologies Inc Device 6704 (prog-if 00 [VGA controller])
              Subsystem: ATI Technologies Inc Device 0b00
              Flags: bus master, fast devsel, latency 0, IRQ 40
              Memory at b0000000 (64-bit, prefetchable) [size=256M]
              Memory at d9fc0000 (64-bit, non-prefetchable) [size=128K]
              I/O ports at ec00 [size=256]
              Expansion ROM at d9000000 [disabled] [size=128K]
              Capabilities: [50] Power Management version 3
              Capabilities: [58] Express Legacy Endpoint, MSI 00
              Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
              Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
              Capabilities: [150] Advanced Error Reporting
              Kernel driver in use: fglrx_pci
      
      05:00.0 VGA compatible controller: ATI Technologies Inc Device 6704 (prog-if 00 [VGA controller])
              Subsystem: ATI Technologies Inc Device 0b00
              Flags: bus master, fast devsel, latency 0, IRQ 48
              Memory at c0000000 (64-bit, prefetchable) [size=256M]
              Memory at dbfc0000 (64-bit, non-prefetchable) [size=128K]
              I/O ports at dc00 [size=256]
              Expansion ROM at db000000 [disabled] [size=128K]
              Capabilities: [50] Power Management version 3
              Capabilities: [58] Express Legacy Endpoint, MSI 00
              Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
              Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
              Capabilities: [150] Advanced Error Reporting
              Kernel driver in use: fglrx_pci
      
      42:00.0 VGA compatible controller: ATI Technologies Inc Device 6704 (prog-if 00 [VGA controller])
              Subsystem: ATI Technologies Inc Device 0b00
              Flags: bus master, fast devsel, latency 0, IRQ 72
              Memory at 90000000 (64-bit, prefetchable) [size=256M]
              Memory at acfc0000 (64-bit, non-prefetchable) [size=128K]
              I/O ports at 7c00 [size=256]
              Expansion ROM at ac000000 [disabled] [size=128K]
              Capabilities: [50] Power Management version 3
              Capabilities: [58] Express Legacy Endpoint, MSI 00
              Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
              Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
              Capabilities: [150] Advanced Error Reporting
              Kernel driver in use: fglrx_pci
      
      

       

      - fglrx module is correctly loaded into the kernel:

       

      # lspci | grep fglrx
      fglrx                4608069  0 
      button                 12895  1 fglrx
      

       

      - Kernel output regarding fglrx also seems ok:

       

      # dmesg | grep fglrx
      [    5.270174] fglrx: module license 'Proprietary. (C) 2002 - ATI Technologies, Starnberg, GERMANY' taints kernel.
      [    5.270185] fglrx: module license 'Proprietary. (C) 2002 - ATI Technologies, Starnberg, GERMANY' taints kernel.
      [    5.270195] fglrx: module license 'Proprietary. (C) 2002 - ATI Technologies, Starnberg, GERMANY' taints kernel.
      [    5.411599] [fglrx] Maximum main memory to use for locked dma buffers: 63290 MBytes.
      [    5.413845] [fglrx]   vendor: 1002 device: 6704 count: 1
      [    5.413851] [fglrx]   vendor: 1002 device: 6704 count: 2
      [    5.413860] [fglrx]   vendor: 1002 device: 6704 count: 3
      [    5.416246] [fglrx] ioport: bar 4, base 0xec00, size: 0x100
      [    5.416379] [fglrx] ioport: bar 4, base 0xdc00, size: 0x100
      [    5.416438] [fglrx] ioport: bar 4, base 0x7c00, size: 0x100
      [    5.416694] [fglrx] Kernel PAT support is enabled
      [    5.416750] [fglrx] module loaded - fglrx 9.0.2 [Nov 20 2012] with 3 minors
      

       

      - aticonfig agrees, too:

       

      # aticonfig --lsa
      * 0. 04:00.0 AMD FirePro V7900 (FireGL V)
        1. 05:00.0 AMD FirePro V7900 (FireGL V)
        2. 42:00.0 AMD FirePro V7900 (FireGL V)
      
      
      * - Default adapter
      

       

      After executing aticonfig --initial --adapter=all, this is what /etc/X11/xorg.conf looks like:

       

      Section "ServerLayout"
              Identifier     "aticonfig Layout"
              Screen      0  "aticonfig-Screen[0]-0" 0 0
              Screen         "aticonfig-Screen[1]-0" RightOf "aticonfig-Screen[0]-0"
              Screen         "aticonfig-Screen[2]-0" RightOf "aticonfig-Screen[1]-0"
      EndSection
      
      
      Section "Module"
      EndSection
      
      
      Section "Monitor"
              Identifier   "aticonfig-Monitor[0]-0"
              Option      "VendorName" "ATI Proprietary Driver"
              Option      "ModelName" "Generic Autodetecting Monitor"
              Option      "DPMS" "true"
      EndSection
      
      
      Section "Monitor"
              Identifier   "aticonfig-Monitor[1]-0"
              Option      "VendorName" "ATI Proprietary Driver"
              Option      "ModelName" "Generic Autodetecting Monitor"
              Option      "DPMS" "true"
      EndSection
      
      
      Section "Monitor"
              Identifier   "aticonfig-Monitor[2]-0"
              Option      "VendorName" "ATI Proprietary Driver"
              Option      "ModelName" "Generic Autodetecting Monitor"
              Option      "DPMS" "true"
      EndSection
      
      
      Section "Device"
              Identifier  "aticonfig-Device[0]-0"
              Driver      "fglrx"
              BusID       "PCI:4:0:0"
      EndSection
      
      
      Section "Device"
              Identifier  "aticonfig-Device[1]-0"
              Driver      "fglrx"
              BusID       "PCI:5:0:0"
      EndSection
      
      
      Section "Device"
              Identifier  "aticonfig-Device[2]-0"
              Driver      "fglrx"
              BusID       "PCI:66:0:0"
      EndSection
      
      
      Section "Screen"
              Identifier "aticonfig-Screen[0]-0"
              Device     "aticonfig-Device[0]-0"
              Monitor    "aticonfig-Monitor[0]-0"
              DefaultDepth     24
              SubSection "Display"
                      Viewport   0 0
                      Depth     24
              EndSubSection
      EndSection
      
      
      Section "Screen"
              Identifier "aticonfig-Screen[1]-0"
              Device     "aticonfig-Device[1]-0"
              Monitor    "aticonfig-Monitor[1]-0"
              DefaultDepth     24
              SubSection "Display"
                      Viewport   0 0
                      Depth     24
              EndSubSection
      EndSection
      
      
      Section "Screen"
              Identifier "aticonfig-Screen[2]-0"
              Device     "aticonfig-Device[2]-0"
              Monitor    "aticonfig-Monitor[2]-0"
              DefaultDepth     24
              SubSection "Display"
                      Viewport   0 0
                      Depth     24
              EndSubSection
      EndSection
      

       

      Everything looks fine up to this point. Now, problems start to arise. For instance, the "clinfo" command (which lists all OpenCL-capable devices found on the machine) hangs the computer, resulting in a 100% CPU consuming process which turns out to be impossible to kill. This seems a clear symptom of kernel-level troubles, such as the driver getting stuck on I/O deadlock or nasty stuff like that.

       

      I have attached the output of this command to this post (too long for pasting here):

       

      # strace clinfo 2>&1

       

      --EDITED-- Somehow I can't attach files, see it here: http://pastebin.com/pqHJFkty

       

      As you can see, the log stops at the precise moment of opening the card at /dev/ati/card0; that's when the process gets stuck indefinitely. Here is an example of what "top" says about this:

       

      PID        USER      PR  NI  VIRT     RES  SHR  S  %CPU %MEM    TIME+    COMMAND                                                                                         
      16510   root          20   0    54636  9028 6732 R  100      0.0           3:03.45  clinfo
      


      As stated before, "kill -9 <PID>" will have no effect. By the way, I've observed that those device nodes are created dynamically by the driver as need be, but I've also tried to create them manually after a reboot running the following command:

       

      # mkdir /dev/ati; for i in `lspci | grep VGA | grep ATI | wc -l`; do mknod -m 666 /dev/ati/card$i c 250 $i; done

       

      It makes no difference later, though.

       

      Any idea about what could be causing this odd behaviour? Everything points to a low-level problem, be it driver or hardware issues (it's hard to believe that 4 cards are faulty in the same way, nonetheless). As a matter of fact, we have 2 V9800 cards in another node of the same computing cluster, and we got them to work very easily following the same steps we are taking with this ones. I can provide logs from both machines for comparison's sake, if you find it worthwhile. Suggestions are greatly appreciated.