9 Replies Latest reply on Oct 18, 2013 5:55 AM by himanshu.gautam

    High interrupt rate on multigpu multi cpu system but not on multigpu single cpu system

    fredriko

      Hello!

       

      When running a program I developed in JavaCL It runs well on the development system which is a single cpu i7 3,5GHz machine that has 2 HD7990 cards and 2 16x pcie slots that clock down to 8x when I have 2 cards. But on the to be production system that is a dual 2,4GHz Xeon system with 3 16X pcie slots and also 2 HD7990 cards the interrupts take about 30% on each of the cores. The interrupts start when I start accessing stuff on the GPU cards, but are there even when there is no load on the GPU cards. Also the speed of the GPUS are quite a bit slower on the Xeon machine.

       

      Is this something anyone else have seen? Do you have any ideas how to debug this?

        • Re: High interrupt rate on multigpu multi cpu system but not on multigpu single cpu system
          himanshu.gautam

          This could be either your BIOS (or) Board problem.....

          Can you try with single GPU?

          Is your board supplying adequate Power?

          What is the interrupt rate on the i7 machine?

          How do you check this interrupt rate?

          Which OS are you in? What bitness?

           

          When you said "The interrupts start when I start accessing stuff..." -- Are you talking about running OpenCL programs?

          So, When you start running your programs, interrupt rate rises and then sticks to that rate even if you dont run OpenCL programs..

          Is that what you mean?

           

          - Bruhaspati

            • Re: High interrupt rate on multigpu multi cpu system but not on multigpu single cpu system
              fredriko

              I Tried with one card just now and the interrupt rate is halved but still about 15% on all 8 cores.

               

              The power should be enough we ran with 3 cards at most and that did work we have a 1200w psu and the card itself is a workstation card that should be ok with GPU:s.

               

              On the i7 machine the interrupt rate is negligible. I check it with top on 64 bit linux  (Ubuntu 13.04)  and check how much %CPU the ksoftirqd processes use.

               

              The interrupts start when I do the first accesses to OpenCL, before I do start to run calculations on the GPUs, and they continue untill I shut down the program even if the calculations are done. It feels like OpenCL is using some kinds of callbacks from the GPUs via interrupts and the rate of those are so high that it takes a lot of CPU time. But why do I not see it on the i7 machine?

                • Re: High interrupt rate on multigpu multi cpu system but not on multigpu single cpu system
                  himanshu.gautam

                  Can you quickly check "dmesg" for any abnormal messages from "fglrx" driver?

                  + What motherboard are you using?

                    • Re: High interrupt rate on multigpu multi cpu system but not on multigpu single cpu system
                      fredriko

                      I have done a bit more debugging, watching /proc/interrupts show that all irq:s go to core 0 on the xeon machine and on the i7 machine it is divided evenly across all cores.

                      The motherboard is a supermicro X9DR3-F

                       

                      Here are all fglrx messages since startup:

                       

                      [    6.601378] fglrx: module license 'Proprietary. (C) 2002 - ATI Technologies, Starnberg, GERMANY' taints kernel.
                      [    6.621848] <6>[fglrx] Maximum main memory to use for locked dma buffers: 63176 MBytes.
                      [    6.622941] <6>[fglrx]   vendor: 1002 device: 679b count: 1
                      [    6.622943] <6>[fglrx]   vendor: 1002 device: 679b count: 2
                      [    6.622948] <6>[fglrx]   vendor: 1002 device: 679b count: 3
                      [    6.622949] <6>[fglrx]   vendor: 1002 device: 679b count: 4
                      [    6.624711] <6>[fglrx] ioport: bar 4, base 0x6000, size: 0x100
                      [    6.624758] <6>[fglrx] ioport: bar 4, base 0x5000, size: 0x100
                      [    6.624775] <6>[fglrx] ioport: bar 4, base 0xf000, size: 0x100
                      [    6.624796] <6>[fglrx] ioport: bar 4, base 0xe000, size: 0x100
                      [    6.625521] <6>[fglrx] Kernel PAT support is enabled
                      [    6.625555] <6>[fglrx] module loaded - fglrx 13.20.5 [Sep 21 2013] with 4 minors
                      [   10.646986] fglrx_pci 0000:06:00.0: irq 144 for MSI/MSI-X
                      [   10.647598] <6>[fglrx] Firegl kernel thread PID: 1693
                      [   10.647768] <6>[fglrx] Firegl kernel thread PID: 1694
                      [   10.647936] <6>[fglrx] Firegl kernel thread PID: 1695
                      [   10.648052] <6>[fglrx] IRQ 144 Enabled
                      [   10.656594] <6>[fglrx] Reserved FB block: Shared offset:0, size:1000000
                      [   10.656596] <6>[fglrx] Reserved FB block: Unshared offset:f878000, size:4000
                      [   10.656598] <6>[fglrx] Reserved FB block: Unshared offset:f87c000, size:484000
                      [   10.656599] <6>[fglrx] Reserved FB block: Unshared offset:fff8000, size:8000
                      [   10.656601] <6>[fglrx] Reserved FB block: Unshared offset:bfff4000, size:c000
                      [   10.968939] fglrx_pci 0000:07:00.0: irq 145 for MSI/MSI-X
                      [   10.969482] <6>[fglrx] Firegl kernel thread PID: 1696
                      [   10.969565] <6>[fglrx] Firegl kernel thread PID: 1697
                      [   10.969638] <6>[fglrx] Firegl kernel thread PID: 1698
                      [   10.969766] <6>[fglrx] IRQ 145 Enabled
                      [   10.979921] <6>[fglrx] Reserved FB block: Shared offset:0, size:1000000
                      [   10.979924] <6>[fglrx] Reserved FB block: Unshared offset:f878000, size:4000
                      [   10.979925] <6>[fglrx] Reserved FB block: Unshared offset:f87c000, size:484000
                      [   10.979927] <6>[fglrx] Reserved FB block: Unshared offset:bfff4000, size:c000
                      [   11.817988] fglrx_pci 0000:84:00.0: irq 146 for MSI/MSI-X
                      [   11.818583] <6>[fglrx] Firegl kernel thread PID: 1700
                      [   11.818703] <6>[fglrx] Firegl kernel thread PID: 1701
                      [   11.818808] <6>[fglrx] Firegl kernel thread PID: 1702
                      [   11.818918] <6>[fglrx] IRQ 146 Enabled
                      [   11.831381] <6>[fglrx] Reserved FB block: Shared offset:0, size:1000000
                      [   11.831384] <6>[fglrx] Reserved FB block: Unshared offset:f878000, size:4000
                      [   11.831385] <6>[fglrx] Reserved FB block: Unshared offset:f87c000, size:484000
                      [   11.831387] <6>[fglrx] Reserved FB block: Unshared offset:bfff4000, size:c000
                      [   12.671659] fglrx_pci 0000:85:00.0: irq 147 for MSI/MSI-X
                      [   12.672299] <6>[fglrx] Firegl kernel thread PID: 1703
                      [   12.672439] <6>[fglrx] Firegl kernel thread PID: 1704
                      [   12.672589] <6>[fglrx] Firegl kernel thread PID: 1705
                      [   12.672716] <6>[fglrx] IRQ 147 Enabled
                      [   12.685148] <6>[fglrx] Reserved FB block: Shared offset:0, size:1000000
                      [   12.685150] <6>[fglrx] Reserved FB block: Unshared offset:f878000, size:4000
                      [   12.685152] <6>[fglrx] Reserved FB block: Unshared offset:f87c000, size:484000
                      [   12.685153] <6>[fglrx] Reserved FB block: Unshared offset:bfff4000, size:c000
                      [  915.024102] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.
                      [ 1036.374665] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.
                      [ 1290.222935] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.
                      [ 1306.826927] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.
                      [ 1344.754973] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.
                      [ 1421.088117] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.
                      [ 1612.406398] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.