cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jhoffmann
Journeyman III

PCIe Performance Problem with HD5870

Catalyst 10.3, Linux 2.6.31.12 (x86_64)

Hi guys,

we have measured a PCIe performance impact executing CPU->GPU and, even harder, GPU->CPU transfers. The impacts was found with ATI's PCIeSpeedTest PowerToy (cal), with NVidias OCLBandwith test (opencl) and with our own benchmark (opencl). See below.

We think that this is a driver bug, because the hardware link is set up properly to PCIe 16x, 5GT/s (checked with lspci -vv).
Maybe someone has an idea how we can fix this?

Regards
Joern Hoffmann
University of Leipzig
Computer Engineering Group


Hardware: 20 PCs each with a HD5870, Core i7 950, 12GB DDR running on a Asus P6T SE board.
Software: OpenSuse 11.2, Linux 2.6.31.12, glibc-2.1, Xorg 7.4-35.3, Xserver 1.6.5
Driver  : fglrx 8.712(10.3), also testet: 8.712.3.1 (10.3 OGL4 preview)


Measure (1): PCIe SpeedTest v0.2 on HD5870
------------
Peak CPU->GPU Bandwidth =   4.324 GB/sec [data size = 4194304 bytes]
Peak GPU->CPU Bandwidth = 655.360 MB/sec [data size = 32768 bytes]

-> Arghhh, peak at 650 MB/sec!


Measure (2a): oclBandWidthTest on HD5870
-------------
 Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            1503.7

 Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            1042.7

 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            106887.6


Measure (2b): oclBandWidthTest on NVidia 9800GT
-------------
Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            2280.9

 Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            1723.5

 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            49929.0


Measure (3a): transfer of 8192 float numbers (32kb) on HD5870
-------------
OpenCL buffer transfer time
  submission-to-start  : 440529 ns
  execution time       :  29420 ns

Measure (3b): transfer of 8192 float numbers (32kb) on NVidia 9800GT
-------------
OpenCL buffer transfer time
  submission-to-start  :  44608 ns
  execution time       :  15712 ns


lspci -vv:
----------
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 13) (prog-if 00 [Normal decode])
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
    Latency: 0, Cache Line Size: 256 bytes
    Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
    I/O behind bridge: 0000b000-0000bfff
    Memory behind bridge: fbb00000-fbbfffff
    Prefetchable memory behind bridge: 00000000d0000000-00000000dfffffff
    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
    BridgeCtl: Parity- SERR+ NoISA- VGA+ MAbort- >Reset- FastB2B-
        PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
    Capabilities: [40] Subsystem: ASUSTeK Computer Inc. Device 836b
    Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
        Address: fee002b8  Data: 0000
        Masking: 00000003  Pending: 00000000
    Capabilities: [90] Express (v2) Root Port (Slot+), MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
            ExtTag+ RBE+ FLReset-
        DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
            MaxPayload 128 bytes, MaxReadReq 128 bytes
        DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <512ns, L1 <64us
            ClockPM- Surprise+ LLActRep+ BwNot+
        LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+
        SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surpise-
            Slot #  2, PowerLimit 75.000000; Interlock- NoCompl-
        SltCtl:    Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
            Control: AttnInd Off, PwrInd Off, Power- Interlock-
        SltSta:    Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
            Changed: MRL- PresDet+ LinkState+
        RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
        RootCap: CRSVisible-
        RootSta: PME ReqID 0000, PMEStatus- PMEPending-
        DevCap2: Completion Timeout: Range BCD, TimeoutDis+ ARIFwd+
        DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis- ARIFwd-
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB
    Capabilities: [e0] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [100] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
    Capabilities: [150] Access Control Services
        ACSCap:    SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
        ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
    Capabilities: [160] Vendor Specific Information <?>
    Kernel driver in use: pcieport-driver


02:00.0 VGA compatible controller: ATI Technologies Inc Device 6898 (prog-if 00 [VGA controller])
    Subsystem: ATI Technologies Inc Device 0b00
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
    Latency: 0, Cache Line Size: 256 bytes
    Interrupt: pin A routed to IRQ 59
    Region 0: Memory at d0000000 (64-bit, prefetchable) [size=256M]
    Region 2: Memory at fbbc0000 (64-bit, non-prefetchable) [size=128K]
    Region 4: I/O ports at b000 [size=256]
    Expansion ROM at fbba0000 [disabled] [size=128K]
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us
            ClockPM- Surprise- LLActRep- BwNot-
        LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Not Supported, TimeoutDis-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB
    Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Address: 00000000fee00498  Data: 0000
    Capabilities: [100] Vendor Specific Information <?>
    Capabilities: [150] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap:    First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn-
    Kernel driver in use: fglrx_pci

===> Testing device 0 <=== Device type: Unknown Max resource 2D width/height: 16384/16384 Total GPU memory size: 1024 MB Total CPU cached space size: 508 MB Total CPU uncached space size: 1279 MB GPU engine clock: 0 MHz GPU memory clock: 0 MHz Number of timing loops: 100 [ 16 bytes] CPU->GPU= 1.600 MB/sec, GPU->CPU= 800.000 KB/sec [ 32 bytes] CPU->GPU= 1.600 MB/sec, GPU->CPU= 533.333 KB/sec [ 64 bytes] CPU->GPU= 2.133 MB/sec, GPU->CPU= 2.133 MB/sec [ 128 bytes] CPU->GPU= 4.267 MB/sec, GPU->CPU= 4.267 MB/sec [ 256 bytes] CPU->GPU= 8.533 MB/sec, GPU->CPU= 8.533 MB/sec [ 512 bytes] CPU->GPU= 17.067 MB/sec, GPU->CPU= 25.600 MB/sec [ 1024 bytes] CPU->GPU= 17.067 MB/sec, GPU->CPU= 34.133 MB/sec [ 2048 bytes] CPU->GPU= 68.267 MB/sec, GPU->CPU= 68.267 MB/sec [ 4096 bytes] CPU->GPU= 136.533 MB/sec, GPU->CPU= 204.800 MB/sec [ 8192 bytes] CPU->GPU= 409.600 MB/sec, GPU->CPU= 273.067 MB/sec [ 16384 bytes] CPU->GPU= 546.133 MB/sec, GPU->CPU= 546.133 MB/sec [ 32768 bytes] CPU->GPU= 1.092 GB/sec, GPU->CPU= 655.360 MB/sec [ 65536 bytes] CPU->GPU= 2.185 GB/sec, GPU->CPU= 595.782 MB/sec [ 131072 bytes] CPU->GPU= 3.277 GB/sec, GPU->CPU= 524.288 MB/sec [ 262144 bytes] CPU->GPU= 3.277 GB/sec, GPU->CPU= 485.452 MB/sec [ 524288 bytes] CPU->GPU= 3.745 GB/sec, GPU->CPU= 472.332 MB/sec [ 1048576 bytes] CPU->GPU= 4.194 GB/sec, GPU->CPU= 459.902 MB/sec [ 2097152 bytes] CPU->GPU= 4.280 GB/sec, GPU->CPU= 449.069 MB/sec [ 4194304 bytes] CPU->GPU= 4.324 GB/sec, GPU->CPU= 442.904 MB/sec [ 8388608 bytes] CPU->GPU= 4.280 GB/sec, GPU->CPU= 438.964 MB/sec [ 16777216 bytes] CPU->GPU= 4.258 GB/sec, GPU->CPU= 437.476 MB/sec [ 33554432 bytes] CPU->GPU= 4.052 GB/sec, GPU->CPU= 443.607 MB/sec [ 67108864 bytes] CPU->GPU= 4.090 GB/sec, GPU->CPU= 452.826 MB/sec [ 134217728 bytes] CPU->GPU= 4.108 GB/sec, GPU->CPU= 468.212 MB/sec [ 268435456 bytes] CPU->GPU= 4.136 GB/sec, GPU->CPU= 492.307 MB/sec [ 536870912 bytes] CPU->GPU= 4.211 GB/sec, GPU->CPU= 496.065 MB/sec calResAllocLocal2D() returned an error when trying to allocate 1073741824 bytes! Peak CPU->GPU Bandwidth = 4.324 GB/sec [data size = 4194304 bytes] Peak GPU->CPU Bandwidth = 655.360 MB/sec [data size = 32768 bytes]

Tags (3)
0 Likes
45 Replies
xero
Journeyman III

PCIe Performance Problem with HD5870

Hi Joern,

I got the similar PCI speed test results. The GPU->CPU is very slow.

(CPU: intel i5 750,  MB: intel P55, GPU: HD5870, OS: Linux 2.6.18 i386, Driver: 10.2)

Do you have any progress on this?

 

 

0 Likes
jhoffmann
Journeyman III

PCIe Performance Problem with HD5870

 xero,

yes we made a (negative) progress. Now we've testet the PCIe transfer rates under Windows 7 x86_64 with the PCIe SpeedTest v0.2 and also with Sissoft Sandra. Unfortunatly the same results.

CPU-> GPU is about 4GB/s
GPU-> CPU is about 450 MB/s

We also checked the machine against different other benchmarks(3DMark, Sisoftsandra, Unigine Heaven). The measures like the host memory bandwith, the GPU perfomance, CPU speed etc. look quiet well. Only the PCIe transfer rates, especially form the GPU-to-CPU is around 6% of it's theorethical maximum.

 

To say it explicitly: we use the recent drivers from ATI (10.3) and Intel (chipset autoinst 911) . Also we've flashed the BIOS to latest version( v808) .

Furthermore I played around with the BIOS configs, eg. diabled the sleep states, manually configured the memory, manually set the QPI-Interface etc.pp. - no change at all...

Thus, we now suspect a chipset bug within the Intel x58 northbridge rathen than a ATI driver bug. I've read similar transfer centric issues around the x58 chipset on a german site: http://www.planet3dnow.de/vbulletin/showthread.php?t=364174

We will clear this issue on monday because we want to replace the HD5870 by an GTX 270... Stay tuned...

Joern

 

0 Likes
xero
Journeyman III

PCIe Performance Problem with HD5870

Hi Joern,

Thanks for the information.

I tried to install a HD4870 on the P55 mainboard. The result is as slow as the 5870.

I also intalled the 5870 on a P45 mainboard. The CPU->GPU/GPU->CPU speed can reach ~5GB/s.

 

0 Likes
jhoffmann
Journeyman III

PCIe Performance Problem with HD5870

Hi xero,

I can confirm good results with a HD5850 on a P43 board / Core2Quad Q6700. The transfer bandwith reaches at peak:

CPU -> GPU : ~5 GB/s
GPU -> CPU : ~6 GB/s

So there will be two solutions left, either there is a ATI driver issue related to the x58 or the problem is inside the X58 chip.

I'm curious about the test on monday the 04-12-2010 with the Nvidia card...

Thank you...

 

 

 

0 Likes
Tzupy
Journeyman III

PCIe Performance Problem with HD5870

Hi,

I am also interested in high GPU -> CPU bandwidth, for off-screen rendering of large images.

There seems to be an issue with the X58 chipset and ATI cards, especially the new 5850 / 5870 ones, severely limiting the readback bandwidth.

For comparison, my 4850 with 1GB on i7-920 with 6 GB DDR3-1066C7 and Vista64 HP gets about 1.2 GB/s maximum readback and for larger blocks only 950 MB/s, of course tested with PCIe Speed Test (with glReadPixels I get lower values).

Even with the latest v4.0 beta drivers the problem hasn't been solved for radeons, but it *may* have been solved for FirePro cards. It's about using hardware DMA, there seems to be a problem to get it working on X58 systems. Of course, if the problem was solved for FirePro cards, there shouldn't be any *technical* reason for it to be left unsolved for Radeons.

I raised this isue in the OpenGL forums, you can have a look at this thread: http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=275081#Post275081

 

0 Likes
huafeihua116
Journeyman III

PCIe Performance Problem with HD5870

Thanks for sharing shis info>

0 Likes
jhoffmann
Journeyman III

PCIe Performance Problem with HD5870

Hi xero, Tzupy,

we now have tested the system under windows 7 x86_64 on the x58 board with a NVidia GTX 275. The problems are gone.

The GTX reaches under Sissoft Sandra (using OpenCL):

CPU -> GPU : 5.57 GB/s
GPU -> CPU : 5.27 GB/s

With the HD5870 we measured with Sandra (using OpenCL, Stream and Direct Compute) and the PCIe SpeedTest (ATI Stream):

CPU-> GPU : ~ 4GB/s
GPU-> CPU : ~ 450 MB/s

So we now can say we hit a driver bug related to the x58 because the problem doesn't show up under a p45 chipset or an AMD mainboard.

Maybe the problem is a regession because there was a problem with x58 mainboards in 11/2008: (ATI Catalyst x58 Hotfix http://ht4u.net/news/2655_neue_geforce-treiber_von_nvidia_und_ati-catalyst-hotfix_fuer_x58/

Now it's time for ATI to react.

Joern

 

 

0 Likes
xero
Journeyman III

PCIe Performance Problem with HD5870

thx Joern.

Have you send a message to AMD yet?

0 Likes
jhoffmann
Journeyman III

PCIe Performance Problem with HD5870

Hi xero,

no. The reason is I dotn't know how. Is there a bug track system or a hotline?

0 Likes