Hi guys,
we have measured a PCIe performance impact executing CPU->GPU and, even harder, GPU->CPU transfers. The impacts was found with ATI's PCIeSpeedTest PowerToy (cal), with NVidias OCLBandwith test (opencl) and with our own benchmark (opencl). See below.
We think that this is a driver bug, because the hardware link is set up properly to PCIe 16x, 5GT/s (checked with lspci -vv).
Maybe someone has an idea how we can fix this?
Regards
Joern Hoffmann
University of Leipzig
Computer Engineering Group
Hardware: 20 PCs each with a HD5870, Core i7 950, 12GB DDR running on a Asus P6T SE board.
Software: OpenSuse 11.2, Linux 2.6.31.12, glibc-2.1, Xorg 7.4-35.3, Xserver 1.6.5
Driver : fglrx 8.712(10.3), also testet: 8.712.3.1 (10.3 OGL4 preview)
Measure (1): PCIe SpeedTest v0.2 on HD5870
------------
Peak CPU->GPU Bandwidth = 4.324 GB/sec [data size = 4194304 bytes]
Peak GPU->CPU Bandwidth = 655.360 MB/sec [data size = 32768 bytes]
-> Arghhh, peak at 650 MB/sec!
Measure (2a): oclBandWidthTest on HD5870
-------------
Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1503.7
Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1042.7
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 106887.6
Measure (2b): oclBandWidthTest on NVidia 9800GT
-------------
Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2280.9
Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1723.5
Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 49929.0
Measure (3a): transfer of 8192 float numbers (32kb) on HD5870
-------------
OpenCL buffer transfer time
submission-to-start : 440529 ns
execution time : 29420 ns
Measure (3b): transfer of 8192 float numbers (32kb) on NVidia 9800GT
-------------
OpenCL buffer transfer time
submission-to-start : 44608 ns
execution time : 15712 ns
lspci -vv:
----------
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 13) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
I/O behind bridge: 0000b000-0000bfff
Memory behind bridge: fbb00000-fbbfffff
Prefetchable memory behind bridge: 00000000d0000000-00000000dfffffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA+ MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Subsystem: ASUSTeK Computer Inc. Device 836b
Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
Address: fee002b8 Data: 0000
Masking: 00000003 Pending: 00000000
Capabilities: [90] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <512ns, L1 <64us
ClockPM- Surprise+ LLActRep+ BwNot+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surpise-
Slot # 2, PowerLimit 75.000000; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Off, PwrInd Off, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet+ LinkState+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range BCD, TimeoutDis+ ARIFwd+
DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis- ARIFwd-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB
Capabilities: [e0] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [150] Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [160] Vendor Specific Information <?>
Kernel driver in use: pcieport-driver
02:00.0 VGA compatible controller: ATI Technologies Inc Device 6898 (prog-if 00 [VGA controller])
Subsystem: ATI Technologies Inc Device 0b00
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Interrupt: pin A routed to IRQ 59
Region 0: Memory at d0000000 (64-bit, prefetchable) [size=256M]
Region 2: Memory at fbbc0000 (64-bit, non-prefetchable) [size=128K]
Region 4: I/O ports at b000 [size=256]
Expansion ROM at fbba0000 [disabled] [size=128K]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00498 Data: 0000
Capabilities: [100] Vendor Specific Information <?>
Capabilities: [150] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn-
Kernel driver in use: fglrx_pci
===> Testing device 0 <=== Device type: Unknown Max resource 2D width/height: 16384/16384 Total GPU memory size: 1024 MB Total CPU cached space size: 508 MB Total CPU uncached space size: 1279 MB GPU engine clock: 0 MHz GPU memory clock: 0 MHz Number of timing loops: 100 [ 16 bytes] CPU->GPU= 1.600 MB/sec, GPU->CPU= 800.000 KB/sec [ 32 bytes] CPU->GPU= 1.600 MB/sec, GPU->CPU= 533.333 KB/sec [ 64 bytes] CPU->GPU= 2.133 MB/sec, GPU->CPU= 2.133 MB/sec [ 128 bytes] CPU->GPU= 4.267 MB/sec, GPU->CPU= 4.267 MB/sec [ 256 bytes] CPU->GPU= 8.533 MB/sec, GPU->CPU= 8.533 MB/sec [ 512 bytes] CPU->GPU= 17.067 MB/sec, GPU->CPU= 25.600 MB/sec [ 1024 bytes] CPU->GPU= 17.067 MB/sec, GPU->CPU= 34.133 MB/sec [ 2048 bytes] CPU->GPU= 68.267 MB/sec, GPU->CPU= 68.267 MB/sec [ 4096 bytes] CPU->GPU= 136.533 MB/sec, GPU->CPU= 204.800 MB/sec [ 8192 bytes] CPU->GPU= 409.600 MB/sec, GPU->CPU= 273.067 MB/sec [ 16384 bytes] CPU->GPU= 546.133 MB/sec, GPU->CPU= 546.133 MB/sec [ 32768 bytes] CPU->GPU= 1.092 GB/sec, GPU->CPU= 655.360 MB/sec [ 65536 bytes] CPU->GPU= 2.185 GB/sec, GPU->CPU= 595.782 MB/sec [ 131072 bytes] CPU->GPU= 3.277 GB/sec, GPU->CPU= 524.288 MB/sec [ 262144 bytes] CPU->GPU= 3.277 GB/sec, GPU->CPU= 485.452 MB/sec [ 524288 bytes] CPU->GPU= 3.745 GB/sec, GPU->CPU= 472.332 MB/sec [ 1048576 bytes] CPU->GPU= 4.194 GB/sec, GPU->CPU= 459.902 MB/sec [ 2097152 bytes] CPU->GPU= 4.280 GB/sec, GPU->CPU= 449.069 MB/sec [ 4194304 bytes] CPU->GPU= 4.324 GB/sec, GPU->CPU= 442.904 MB/sec [ 8388608 bytes] CPU->GPU= 4.280 GB/sec, GPU->CPU= 438.964 MB/sec [ 16777216 bytes] CPU->GPU= 4.258 GB/sec, GPU->CPU= 437.476 MB/sec [ 33554432 bytes] CPU->GPU= 4.052 GB/sec, GPU->CPU= 443.607 MB/sec [ 67108864 bytes] CPU->GPU= 4.090 GB/sec, GPU->CPU= 452.826 MB/sec [ 134217728 bytes] CPU->GPU= 4.108 GB/sec, GPU->CPU= 468.212 MB/sec [ 268435456 bytes] CPU->GPU= 4.136 GB/sec, GPU->CPU= 492.307 MB/sec [ 536870912 bytes] CPU->GPU= 4.211 GB/sec, GPU->CPU= 496.065 MB/sec calResAllocLocal2D() returned an error when trying to allocate 1073741824 bytes! Peak CPU->GPU Bandwidth = 4.324 GB/sec [data size = 4194304 bytes] Peak GPU->CPU Bandwidth = 655.360 MB/sec [data size = 32768 bytes]
Hi Tzupy,
with the Catalyst 10.8 (linux 64) there was some kind of a bug regression. In the current 10.9 release the problem is once again fixed...
regards,
joern
FWIW, the 10.9 and 10.10 drivers brought no improvement for me.
I'm scratching my head on what to do: live with this, or buy a GTS 250 - those seem to have the X58 readback issue fixed.
If AMD would have the Bulldozer for sale now and I would be guaranteed that a 6870 would get >= 4GB/s readback with it, then I would buy them.
But for now the i7 dominates the multithreaded software, so I'm not considering switching to a Phenom X6, just to get full readback.
Hey guys, was this issue ever resolved ?
Not the same but somehow related.
With 6870 and after I upgraded to Catalyst 11.2, OpenCL's clEnqueueMapBuffer() (blocking) got almost twice slower as compared to 11.1.
This cannot be reproduced ot 5970.
Originally posted by: gat3way Not the same but somehow related.
With 6870 and after I upgraded to Catalyst 11.2, OpenCL's clEnqueueMapBuffer() (blocking) got almost twice slower as compared to 11.1.
This cannot be reproduced ot 5970.
Please past your code here which helps us to look into issue.
Same bandwitch problem on my new HD 6900 on my GA-P55A-UD4.. The GPU is plugged into first PCIe x16 slot which is working on x16 mode 'cause the second slot is empty.