Hi guys,
we have measured a PCIe performance impact executing CPU->GPU and, even harder, GPU->CPU transfers. The impacts was found with ATI's PCIeSpeedTest PowerToy (cal), with NVidias OCLBandwith test (opencl) and with our own benchmark (opencl). See below.
We think that this is a driver bug, because the hardware link is set up properly to PCIe 16x, 5GT/s (checked with lspci -vv).
Maybe someone has an idea how we can fix this?
Regards
Joern Hoffmann
University of Leipzig
Computer Engineering Group
Hardware: 20 PCs each with a HD5870, Core i7 950, 12GB DDR running on a Asus P6T SE board.
Software: OpenSuse 11.2, Linux 2.6.31.12, glibc-2.1, Xorg 7.4-35.3, Xserver 1.6.5
Driver : fglrx 8.712(10.3), also testet: 8.712.3.1 (10.3 OGL4 preview)
Measure (1): PCIe SpeedTest v0.2 on HD5870
------------
Peak CPU->GPU Bandwidth = 4.324 GB/sec [data size = 4194304 bytes]
Peak GPU->CPU Bandwidth = 655.360 MB/sec [data size = 32768 bytes]
-> Arghhh, peak at 650 MB/sec!
Measure (2a): oclBandWidthTest on HD5870
-------------
Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1503.7
Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1042.7
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 106887.6
Measure (2b): oclBandWidthTest on NVidia 9800GT
-------------
Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2280.9
Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1723.5
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 49929.0
Measure (3a): transfer of 8192 float numbers (32kb) on HD5870
-------------
OpenCL buffer transfer time
submission-to-start : 440529 ns
execution time : 29420 ns
Measure (3b): transfer of 8192 float numbers (32kb) on NVidia 9800GT
-------------
OpenCL buffer transfer time
submission-to-start : 44608 ns
execution time : 15712 ns
lspci -vv:
----------
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 13) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
I/O behind bridge: 0000b000-0000bfff
Memory behind bridge: fbb00000-fbbfffff
Prefetchable memory behind bridge: 00000000d0000000-00000000dfffffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA+ MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Subsystem: ASUSTeK Computer Inc. Device 836b
Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
Address: fee002b8 Data: 0000
Masking: 00000003 Pending: 00000000
Capabilities: [90] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <512ns, L1 <64us
ClockPM- Surprise+ LLActRep+ BwNot+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surpise-
Slot # 2, PowerLimit 75.000000; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Off, PwrInd Off, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet+ LinkState+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range BCD, TimeoutDis+ ARIFwd+
DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis- ARIFwd-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB
Capabilities: [e0] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [150] Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [160] Vendor Specific Information <?>
Kernel driver in use: pcieport-driver
02:00.0 VGA compatible controller: ATI Technologies Inc Device 6898 (prog-if 00 [VGA controller])
Subsystem: ATI Technologies Inc Device 0b00
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Interrupt: pin A routed to IRQ 59
Region 0: Memory at d0000000 (64-bit, prefetchable) [size=256M]
Region 2: Memory at fbbc0000 (64-bit, non-prefetchable) [size=128K]
Region 4: I/O ports at b000 [size=256]
Expansion ROM at fbba0000 [disabled] [size=128K]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00498 Data: 0000
Capabilities: [100] Vendor Specific Information <?>
Capabilities: [150] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn-
Kernel driver in use: fglrx_pci
===> Testing device 0 <=== Device type: Unknown Max resource 2D width/height: 16384/16384 Total GPU memory size: 1024 MB Total CPU cached space size: 508 MB Total CPU uncached space size: 1279 MB GPU engine clock: 0 MHz GPU memory clock: 0 MHz Number of timing loops: 100 [ 16 bytes] CPU->GPU= 1.600 MB/sec, GPU->CPU= 800.000 KB/sec [ 32 bytes] CPU->GPU= 1.600 MB/sec, GPU->CPU= 533.333 KB/sec [ 64 bytes] CPU->GPU= 2.133 MB/sec, GPU->CPU= 2.133 MB/sec [ 128 bytes] CPU->GPU= 4.267 MB/sec, GPU->CPU= 4.267 MB/sec [ 256 bytes] CPU->GPU= 8.533 MB/sec, GPU->CPU= 8.533 MB/sec [ 512 bytes] CPU->GPU= 17.067 MB/sec, GPU->CPU= 25.600 MB/sec [ 1024 bytes] CPU->GPU= 17.067 MB/sec, GPU->CPU= 34.133 MB/sec [ 2048 bytes] CPU->GPU= 68.267 MB/sec, GPU->CPU= 68.267 MB/sec [ 4096 bytes] CPU->GPU= 136.533 MB/sec, GPU->CPU= 204.800 MB/sec [ 8192 bytes] CPU->GPU= 409.600 MB/sec, GPU->CPU= 273.067 MB/sec [ 16384 bytes] CPU->GPU= 546.133 MB/sec, GPU->CPU= 546.133 MB/sec [ 32768 bytes] CPU->GPU= 1.092 GB/sec, GPU->CPU= 655.360 MB/sec [ 65536 bytes] CPU->GPU= 2.185 GB/sec, GPU->CPU= 595.782 MB/sec [ 131072 bytes] CPU->GPU= 3.277 GB/sec, GPU->CPU= 524.288 MB/sec [ 262144 bytes] CPU->GPU= 3.277 GB/sec, GPU->CPU= 485.452 MB/sec [ 524288 bytes] CPU->GPU= 3.745 GB/sec, GPU->CPU= 472.332 MB/sec [ 1048576 bytes] CPU->GPU= 4.194 GB/sec, GPU->CPU= 459.902 MB/sec [ 2097152 bytes] CPU->GPU= 4.280 GB/sec, GPU->CPU= 449.069 MB/sec [ 4194304 bytes] CPU->GPU= 4.324 GB/sec, GPU->CPU= 442.904 MB/sec [ 8388608 bytes] CPU->GPU= 4.280 GB/sec, GPU->CPU= 438.964 MB/sec [ 16777216 bytes] CPU->GPU= 4.258 GB/sec, GPU->CPU= 437.476 MB/sec [ 33554432 bytes] CPU->GPU= 4.052 GB/sec, GPU->CPU= 443.607 MB/sec [ 67108864 bytes] CPU->GPU= 4.090 GB/sec, GPU->CPU= 452.826 MB/sec [ 134217728 bytes] CPU->GPU= 4.108 GB/sec, GPU->CPU= 468.212 MB/sec [ 268435456 bytes] CPU->GPU= 4.136 GB/sec, GPU->CPU= 492.307 MB/sec [ 536870912 bytes] CPU->GPU= 4.211 GB/sec, GPU->CPU= 496.065 MB/sec calResAllocLocal2D() returned an error when trying to allocate 1073741824 bytes! Peak CPU->GPU Bandwidth = 4.324 GB/sec [data size = 4194304 bytes] Peak GPU->CPU Bandwidth = 655.360 MB/sec [data size = 32768 bytes]