Archives Discussions

fredriko · ‎10-09-2013

Hello!

When running a program I developed in JavaCL It runs well on the development system which is a single cpu i7 3,5GHz machine that has 2 HD7990 cards and 2 16x pcie slots that clock down to 8x when I have 2 cards. But on the to be production system that is a dual 2,4GHz Xeon system with 3 16X pcie slots and also 2 HD7990 cards the interrupts take about 30% on each of the cores. The interrupts start when I start accessing stuff on the GPU cards, but are there even when there is no load on the GPU cards. Also the speed of the GPUS are quite a bit slower on the Xeon machine.

Is this something anyone else have seen? Do you have any ideas how to debug this?

himanshu_gautam · ‎10-09-2013

This could be either your BIOS (or) Board problem.....

Can you try with single GPU?

Is your board supplying adequate Power?

What is the interrupt rate on the i7 machine?

How do you check this interrupt rate?

Which OS are you in? What bitness?

When you said "The interrupts start when I start accessing stuff..." -- Are you talking about running OpenCL programs?

So, When you start running your programs, interrupt rate rises and then sticks to that rate even if you dont run OpenCL programs..

Is that what you mean?

- Bruhaspati

fredriko · ‎10-09-2013

I Tried with one card just now and the interrupt rate is halved but still about 15% on all 8 cores.

The power should be enough we ran with 3 cards at most and that did work we have a 1200w psu and the card itself is a workstation card that should be ok with GPU:s.

On the i7 machine the interrupt rate is negligible. I check it with top on 64 bit linux (Ubuntu 13.04) and check how much %CPU the ksoftirqd processes use.

The interrupts start when I do the first accesses to OpenCL, before I do start to run calculations on the GPUs, and they continue untill I shut down the program even if the calculations are done. It feels like OpenCL is using some kinds of callbacks from the GPUs via interrupts and the rate of those are so high that it takes a lot of CPU time. But why do I not see it on the i7 machine?

himanshu_gautam · ‎10-09-2013

Can you quickly check "dmesg" for any abnormal messages from "fglrx" driver?

+ What motherboard are you using?

fredriko · ‎10-09-2013

I have done a bit more debugging, watching /proc/interrupts show that all irq:s go to core 0 on the xeon machine and on the i7 machine it is divided evenly across all cores.

The motherboard is a supermicro X9DR3-F

Here are all fglrx messages since startup:

[    6.601378] fglrx: module license 'Proprietary. (C) 2002 - ATI Technologies, Starnberg, GERMANY' taints kernel.
[    6.621848] <6>[fglrx] Maximum main memory to use for locked dma buffers: 63176 MBytes.
[    6.622941] <6>[fglrx]   vendor: 1002 device: 679b count: 1
[    6.622943] <6>[fglrx]   vendor: 1002 device: 679b count: 2
[    6.622948] <6>[fglrx]   vendor: 1002 device: 679b count: 3
[    6.622949] <6>[fglrx]   vendor: 1002 device: 679b count: 4
[    6.624711] <6>[fglrx] ioport: bar 4, base 0x6000, size: 0x100
[    6.624758] <6>[fglrx] ioport: bar 4, base 0x5000, size: 0x100
[    6.624775] <6>[fglrx] ioport: bar 4, base 0xf000, size: 0x100
[    6.624796] <6>[fglrx] ioport: bar 4, base 0xe000, size: 0x100
[    6.625521] <6>[fglrx] Kernel PAT support is enabled
[    6.625555] <6>[fglrx] module loaded - fglrx 13.20.5 [Sep 21 2013] with 4 minors
[   10.646986] fglrx_pci 0000:06:00.0: irq 144 for MSI/MSI-X
[   10.647598] <6>[fglrx] Firegl kernel thread PID: 1693
[   10.647768] <6>[fglrx] Firegl kernel thread PID: 1694
[   10.647936] <6>[fglrx] Firegl kernel thread PID: 1695
[   10.648052] <6>[fglrx] IRQ 144 Enabled
[   10.656594] <6>[fglrx] Reserved FB block: Shared offset:0, size:1000000
[   10.656596] <6>[fglrx] Reserved FB block: Unshared offset:f878000, size:4000
[   10.656598] <6>[fglrx] Reserved FB block: Unshared offset:f87c000, size:484000
[   10.656599] <6>[fglrx] Reserved FB block: Unshared offset:fff8000, size:8000
[   10.656601] <6>[fglrx] Reserved FB block: Unshared offset:bfff4000, size:c000
[   10.968939] fglrx_pci 0000:07:00.0: irq 145 for MSI/MSI-X
[   10.969482] <6>[fglrx] Firegl kernel thread PID: 1696
[   10.969565] <6>[fglrx] Firegl kernel thread PID: 1697
[   10.969638] <6>[fglrx] Firegl kernel thread PID: 1698
[   10.969766] <6>[fglrx] IRQ 145 Enabled
[   10.979921] <6>[fglrx] Reserved FB block: Shared offset:0, size:1000000
[   10.979924] <6>[fglrx] Reserved FB block: Unshared offset:f878000, size:4000
[   10.979925] <6>[fglrx] Reserved FB block: Unshared offset:f87c000, size:484000
[   10.979927] <6>[fglrx] Reserved FB block: Unshared offset:bfff4000, size:c000
[   11.817988] fglrx_pci 0000:84:00.0: irq 146 for MSI/MSI-X
[   11.818583] <6>[fglrx] Firegl kernel thread PID: 1700
[   11.818703] <6>[fglrx] Firegl kernel thread PID: 1701
[   11.818808] <6>[fglrx] Firegl kernel thread PID: 1702
[   11.818918] <6>[fglrx] IRQ 146 Enabled
[   11.831381] <6>[fglrx] Reserved FB block: Shared offset:0, size:1000000
[   11.831384] <6>[fglrx] Reserved FB block: Unshared offset:f878000, size:4000
[   11.831385] <6>[fglrx] Reserved FB block: Unshared offset:f87c000, size:484000
[   11.831387] <6>[fglrx] Reserved FB block: Unshared offset:bfff4000, size:c000
[   12.671659] fglrx_pci 0000:85:00.0: irq 147 for MSI/MSI-X
[   12.672299] <6>[fglrx] Firegl kernel thread PID: 1703
[   12.672439] <6>[fglrx] Firegl kernel thread PID: 1704
[   12.672589] <6>[fglrx] Firegl kernel thread PID: 1705
[   12.672716] <6>[fglrx] IRQ 147 Enabled
[   12.685148] <6>[fglrx] Reserved FB block: Shared offset:0, size:1000000
[   12.685150] <6>[fglrx] Reserved FB block: Unshared offset:f878000, size:4000
[   12.685152] <6>[fglrx] Reserved FB block: Unshared offset:f87c000, size:484000
[   12.685153] <6>[fglrx] Reserved FB block: Unshared offset:bfff4000, size:c000
[ 915.024102] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.
[ 1036.374665] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.
[ 1290.222935] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.
[ 1306.826927] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.
[ 1344.754973] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.
[ 1421.088117] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.
[ 1612.406398] <3>[fglrx:KAS_Mutex_Release] *ERROR* Mutex released without holding it.

himanshu_gautam · ‎10-10-2013

I think that makes sense... Great debug, Good find!

The choice of interrupt routing is based on LAPIC/IOAPIC programming and this choice is exercised by the OS.

Are the OS on both the machines different?

(or) Probably your BIOS is screwing it up... Just check the interrupt settings in BIOS

- Bruhaspati

fredriko · ‎10-11-2013

I did reinstall the machine with ubuntu 12.04 and did a clean install of the amd drivers (the new ones that came yesterday) then it looked a lot better, now there are one interrupt for one card and another for the other card. still only using 2 cores instead of all 8 but 100% improvement.

I tried to run with noapic before that and that made no difference.

himanshu_gautam · ‎10-11-2013

Good! Can I consider this issue as done?

fredriko · ‎10-14-2013

Yes but I still think there might be an issue with numa machines, but I am not certain.

It works well enough for me now.

himanshu_gautam · ‎10-18-2013

Hi Fredrick,

One another user who had this problem has solved it by upgrading to Cat 13.11

To quote the user:

"

I've just upgraded to Catalyst 13.11. I created a deb package for it and installed it, and the ksoftirqd issues gone...

"

For more details, check:

Radeon HD 7950 + Linux + OpenCL = ksoftirqd spam

Best,

Bruhaspati

Archives Discussions

High interrupt rate on multigpu multi cpu system but not on multigpu single cpu system