jhoffmann

PCIe Performance Problem with HD5870

Discussion created by jhoffmann on Mar 31, 2010
Latest reply on Feb 17, 2011 by darkradeon
Catalyst 10.3, Linux 2.6.31.12 (x86_64)

Hi guys,

we have measured a PCIe performance impact executing CPU->GPU and, even harder, GPU->CPU transfers. The impacts was found with ATI's PCIeSpeedTest PowerToy (cal), with NVidias OCLBandwith test (opencl) and with our own benchmark (opencl). See below.

We think that this is a driver bug, because the hardware link is set up properly to PCIe 16x, 5GT/s (checked with lspci -vv).
Maybe someone has an idea how we can fix this?

Regards
Joern Hoffmann
University of Leipzig
Computer Engineering Group


Hardware: 20 PCs each with a HD5870, Core i7 950, 12GB DDR running on a Asus P6T SE board.
Software: OpenSuse 11.2, Linux 2.6.31.12, glibc-2.1, Xorg 7.4-35.3, Xserver 1.6.5
Driver  : fglrx 8.712(10.3), also testet: 8.712.3.1 (10.3 OGL4 preview)


Measure (1): PCIe SpeedTest v0.2 on HD5870
------------
Peak CPU->GPU Bandwidth =   4.324 GB/sec [data size = 4194304 bytes]
Peak GPU->CPU Bandwidth = 655.360 MB/sec [data size = 32768 bytes]

-> Arghhh, peak at 650 MB/sec!


Measure (2a): oclBandWidthTest on HD5870
-------------
 Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            1503.7

 Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            1042.7

 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            106887.6


Measure (2b): oclBandWidthTest on NVidia 9800GT
-------------
Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            2280.9

 Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            1723.5

 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            49929.0


Measure (3a): transfer of 8192 float numbers (32kb) on HD5870
-------------
OpenCL buffer transfer time
  submission-to-start  : 440529 ns
  execution time       :  29420 ns

Measure (3b): transfer of 8192 float numbers (32kb) on NVidia 9800GT
-------------
OpenCL buffer transfer time
  submission-to-start  :  44608 ns
  execution time       :  15712 ns


lspci -vv:
----------
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 13) (prog-if 00 [Normal decode])
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
    Latency: 0, Cache Line Size: 256 bytes
    Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
    I/O behind bridge: 0000b000-0000bfff
    Memory behind bridge: fbb00000-fbbfffff
    Prefetchable memory behind bridge: 00000000d0000000-00000000dfffffff
    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
    BridgeCtl: Parity- SERR+ NoISA- VGA+ MAbort- >Reset- FastB2B-
        PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
    Capabilities: [40] Subsystem: ASUSTeK Computer Inc. Device 836b
    Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
        Address: fee002b8  Data: 0000
        Masking: 00000003  Pending: 00000000
    Capabilities: [90] Express (v2) Root Port (Slot+), MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
            ExtTag+ RBE+ FLReset-
        DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
            MaxPayload 128 bytes, MaxReadReq 128 bytes
        DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <512ns, L1 <64us
            ClockPM- Surprise+ LLActRep+ BwNot+
        LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+
        SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surpise-
            Slot #  2, PowerLimit 75.000000; Interlock- NoCompl-
        SltCtl:    Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
            Control: AttnInd Off, PwrInd Off, Power- Interlock-
        SltSta:    Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
            Changed: MRL- PresDet+ LinkState+
        RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
        RootCap: CRSVisible-
        RootSta: PME ReqID 0000, PMEStatus- PMEPending-
        DevCap2: Completion Timeout: Range BCD, TimeoutDis+ ARIFwd+
        DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis- ARIFwd-
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB
    Capabilities: [e0] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [100] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
    Capabilities: [150] Access Control Services
        ACSCap:    SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
        ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
    Capabilities: [160] Vendor Specific Information <?>
    Kernel driver in use: pcieport-driver


02:00.0 VGA compatible controller: ATI Technologies Inc Device 6898 (prog-if 00 [VGA controller])
    Subsystem: ATI Technologies Inc Device 0b00
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
    Latency: 0, Cache Line Size: 256 bytes
    Interrupt: pin A routed to IRQ 59
    Region 0: Memory at d0000000 (64-bit, prefetchable) [size=256M]
    Region 2: Memory at fbbc0000 (64-bit, non-prefetchable) [size=128K]
    Region 4: I/O ports at b000 [size=256]
    Expansion ROM at fbba0000 [disabled] [size=128K]
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us
            ClockPM- Surprise- LLActRep- BwNot-
        LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Not Supported, TimeoutDis-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB
    Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Address: 00000000fee00498  Data: 0000
    Capabilities: [100] Vendor Specific Information <?>
    Capabilities: [150] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap:    First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn-
    Kernel driver in use: fglrx_pci

===> Testing device 0 <=== Device type: Unknown Max resource 2D width/height: 16384/16384 Total GPU memory size: 1024 MB Total CPU cached space size: 508 MB Total CPU uncached space size: 1279 MB GPU engine clock: 0 MHz GPU memory clock: 0 MHz Number of timing loops: 100 [ 16 bytes] CPU->GPU= 1.600 MB/sec, GPU->CPU= 800.000 KB/sec [ 32 bytes] CPU->GPU= 1.600 MB/sec, GPU->CPU= 533.333 KB/sec [ 64 bytes] CPU->GPU= 2.133 MB/sec, GPU->CPU= 2.133 MB/sec [ 128 bytes] CPU->GPU= 4.267 MB/sec, GPU->CPU= 4.267 MB/sec [ 256 bytes] CPU->GPU= 8.533 MB/sec, GPU->CPU= 8.533 MB/sec [ 512 bytes] CPU->GPU= 17.067 MB/sec, GPU->CPU= 25.600 MB/sec [ 1024 bytes] CPU->GPU= 17.067 MB/sec, GPU->CPU= 34.133 MB/sec [ 2048 bytes] CPU->GPU= 68.267 MB/sec, GPU->CPU= 68.267 MB/sec [ 4096 bytes] CPU->GPU= 136.533 MB/sec, GPU->CPU= 204.800 MB/sec [ 8192 bytes] CPU->GPU= 409.600 MB/sec, GPU->CPU= 273.067 MB/sec [ 16384 bytes] CPU->GPU= 546.133 MB/sec, GPU->CPU= 546.133 MB/sec [ 32768 bytes] CPU->GPU= 1.092 GB/sec, GPU->CPU= 655.360 MB/sec [ 65536 bytes] CPU->GPU= 2.185 GB/sec, GPU->CPU= 595.782 MB/sec [ 131072 bytes] CPU->GPU= 3.277 GB/sec, GPU->CPU= 524.288 MB/sec [ 262144 bytes] CPU->GPU= 3.277 GB/sec, GPU->CPU= 485.452 MB/sec [ 524288 bytes] CPU->GPU= 3.745 GB/sec, GPU->CPU= 472.332 MB/sec [ 1048576 bytes] CPU->GPU= 4.194 GB/sec, GPU->CPU= 459.902 MB/sec [ 2097152 bytes] CPU->GPU= 4.280 GB/sec, GPU->CPU= 449.069 MB/sec [ 4194304 bytes] CPU->GPU= 4.324 GB/sec, GPU->CPU= 442.904 MB/sec [ 8388608 bytes] CPU->GPU= 4.280 GB/sec, GPU->CPU= 438.964 MB/sec [ 16777216 bytes] CPU->GPU= 4.258 GB/sec, GPU->CPU= 437.476 MB/sec [ 33554432 bytes] CPU->GPU= 4.052 GB/sec, GPU->CPU= 443.607 MB/sec [ 67108864 bytes] CPU->GPU= 4.090 GB/sec, GPU->CPU= 452.826 MB/sec [ 134217728 bytes] CPU->GPU= 4.108 GB/sec, GPU->CPU= 468.212 MB/sec [ 268435456 bytes] CPU->GPU= 4.136 GB/sec, GPU->CPU= 492.307 MB/sec [ 536870912 bytes] CPU->GPU= 4.211 GB/sec, GPU->CPU= 496.065 MB/sec calResAllocLocal2D() returned an error when trying to allocate 1073741824 bytes! Peak CPU->GPU Bandwidth = 4.324 GB/sec [data size = 4194304 bytes] Peak GPU->CPU Bandwidth = 655.360 MB/sec [data size = 32768 bytes]

Outcomes