cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jhoffmann
Journeyman III

PCIe Performance Problem with HD5870

Catalyst 10.3, Linux 2.6.31.12 (x86_64)

Hi guys,

we have measured a PCIe performance impact executing CPU->GPU and, even harder, GPU->CPU transfers. The impacts was found with ATI's PCIeSpeedTest PowerToy (cal), with NVidias OCLBandwith test (opencl) and with our own benchmark (opencl). See below.

We think that this is a driver bug, because the hardware link is set up properly to PCIe 16x, 5GT/s (checked with lspci -vv).
Maybe someone has an idea how we can fix this?

Regards
Joern Hoffmann
University of Leipzig
Computer Engineering Group


Hardware: 20 PCs each with a HD5870, Core i7 950, 12GB DDR running on a Asus P6T SE board.
Software: OpenSuse 11.2, Linux 2.6.31.12, glibc-2.1, Xorg 7.4-35.3, Xserver 1.6.5
Driver  : fglrx 8.712(10.3), also testet: 8.712.3.1 (10.3 OGL4 preview)


Measure (1): PCIe SpeedTest v0.2 on HD5870
------------
Peak CPU->GPU Bandwidth =   4.324 GB/sec [data size = 4194304 bytes]
Peak GPU->CPU Bandwidth = 655.360 MB/sec [data size = 32768 bytes]

-> Arghhh, peak at 650 MB/sec!


Measure (2a): oclBandWidthTest on HD5870
-------------
 Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            1503.7

 Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            1042.7

 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            106887.6


Measure (2b): oclBandWidthTest on NVidia 9800GT
-------------
Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            2280.9

 Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            1723.5

 Device to Device Bandwidth, 1 Device(s)
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            49929.0


Measure (3a): transfer of 8192 float numbers (32kb) on HD5870
-------------
OpenCL buffer transfer time
  submission-to-start  : 440529 ns
  execution time       :  29420 ns

Measure (3b): transfer of 8192 float numbers (32kb) on NVidia 9800GT
-------------
OpenCL buffer transfer time
  submission-to-start  :  44608 ns
  execution time       :  15712 ns


lspci -vv:
----------
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 13) (prog-if 00 [Normal decode])
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
    Latency: 0, Cache Line Size: 256 bytes
    Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
    I/O behind bridge: 0000b000-0000bfff
    Memory behind bridge: fbb00000-fbbfffff
    Prefetchable memory behind bridge: 00000000d0000000-00000000dfffffff
    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
    BridgeCtl: Parity- SERR+ NoISA- VGA+ MAbort- >Reset- FastB2B-
        PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
    Capabilities: [40] Subsystem: ASUSTeK Computer Inc. Device 836b
    Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
        Address: fee002b8  Data: 0000
        Masking: 00000003  Pending: 00000000
    Capabilities: [90] Express (v2) Root Port (Slot+), MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
            ExtTag+ RBE+ FLReset-
        DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
            MaxPayload 128 bytes, MaxReadReq 128 bytes
        DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <512ns, L1 <64us
            ClockPM- Surprise+ LLActRep+ BwNot+
        LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+
        SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surpise-
            Slot #  2, PowerLimit 75.000000; Interlock- NoCompl-
        SltCtl:    Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
            Control: AttnInd Off, PwrInd Off, Power- Interlock-
        SltSta:    Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
            Changed: MRL- PresDet+ LinkState+
        RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
        RootCap: CRSVisible-
        RootSta: PME ReqID 0000, PMEStatus- PMEPending-
        DevCap2: Completion Timeout: Range BCD, TimeoutDis+ ARIFwd+
        DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis- ARIFwd-
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB
    Capabilities: [e0] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [100] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap:    First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
    Capabilities: [150] Access Control Services
        ACSCap:    SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
        ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
    Capabilities: [160] Vendor Specific Information <?>
    Kernel driver in use: pcieport-driver


02:00.0 VGA compatible controller: ATI Technologies Inc Device 6898 (prog-if 00 [VGA controller])
    Subsystem: ATI Technologies Inc Device 0b00
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR+ <PERR- INTx-
    Latency: 0, Cache Line Size: 256 bytes
    Interrupt: pin A routed to IRQ 59
    Region 0: Memory at d0000000 (64-bit, prefetchable) [size=256M]
    Region 2: Memory at fbbc0000 (64-bit, non-prefetchable) [size=128K]
    Region 4: I/O ports at b000 [size=256]
    Expansion ROM at fbba0000 [disabled] [size=128K]
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
        LnkCap:    Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us
            ClockPM- Surprise- LLActRep- BwNot-
        LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Not Supported, TimeoutDis-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB
    Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Address: 00000000fee00498  Data: 0000
    Capabilities: [100] Vendor Specific Information <?>
    Capabilities: [150] Advanced Error Reporting
        UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
        UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap:    First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn-
    Kernel driver in use: fglrx_pci

===> Testing device 0 <=== Device type: Unknown Max resource 2D width/height: 16384/16384 Total GPU memory size: 1024 MB Total CPU cached space size: 508 MB Total CPU uncached space size: 1279 MB GPU engine clock: 0 MHz GPU memory clock: 0 MHz Number of timing loops: 100 [ 16 bytes] CPU->GPU= 1.600 MB/sec, GPU->CPU= 800.000 KB/sec [ 32 bytes] CPU->GPU= 1.600 MB/sec, GPU->CPU= 533.333 KB/sec [ 64 bytes] CPU->GPU= 2.133 MB/sec, GPU->CPU= 2.133 MB/sec [ 128 bytes] CPU->GPU= 4.267 MB/sec, GPU->CPU= 4.267 MB/sec [ 256 bytes] CPU->GPU= 8.533 MB/sec, GPU->CPU= 8.533 MB/sec [ 512 bytes] CPU->GPU= 17.067 MB/sec, GPU->CPU= 25.600 MB/sec [ 1024 bytes] CPU->GPU= 17.067 MB/sec, GPU->CPU= 34.133 MB/sec [ 2048 bytes] CPU->GPU= 68.267 MB/sec, GPU->CPU= 68.267 MB/sec [ 4096 bytes] CPU->GPU= 136.533 MB/sec, GPU->CPU= 204.800 MB/sec [ 8192 bytes] CPU->GPU= 409.600 MB/sec, GPU->CPU= 273.067 MB/sec [ 16384 bytes] CPU->GPU= 546.133 MB/sec, GPU->CPU= 546.133 MB/sec [ 32768 bytes] CPU->GPU= 1.092 GB/sec, GPU->CPU= 655.360 MB/sec [ 65536 bytes] CPU->GPU= 2.185 GB/sec, GPU->CPU= 595.782 MB/sec [ 131072 bytes] CPU->GPU= 3.277 GB/sec, GPU->CPU= 524.288 MB/sec [ 262144 bytes] CPU->GPU= 3.277 GB/sec, GPU->CPU= 485.452 MB/sec [ 524288 bytes] CPU->GPU= 3.745 GB/sec, GPU->CPU= 472.332 MB/sec [ 1048576 bytes] CPU->GPU= 4.194 GB/sec, GPU->CPU= 459.902 MB/sec [ 2097152 bytes] CPU->GPU= 4.280 GB/sec, GPU->CPU= 449.069 MB/sec [ 4194304 bytes] CPU->GPU= 4.324 GB/sec, GPU->CPU= 442.904 MB/sec [ 8388608 bytes] CPU->GPU= 4.280 GB/sec, GPU->CPU= 438.964 MB/sec [ 16777216 bytes] CPU->GPU= 4.258 GB/sec, GPU->CPU= 437.476 MB/sec [ 33554432 bytes] CPU->GPU= 4.052 GB/sec, GPU->CPU= 443.607 MB/sec [ 67108864 bytes] CPU->GPU= 4.090 GB/sec, GPU->CPU= 452.826 MB/sec [ 134217728 bytes] CPU->GPU= 4.108 GB/sec, GPU->CPU= 468.212 MB/sec [ 268435456 bytes] CPU->GPU= 4.136 GB/sec, GPU->CPU= 492.307 MB/sec [ 536870912 bytes] CPU->GPU= 4.211 GB/sec, GPU->CPU= 496.065 MB/sec calResAllocLocal2D() returned an error when trying to allocate 1073741824 bytes! Peak CPU->GPU Bandwidth = 4.324 GB/sec [data size = 4194304 bytes] Peak GPU->CPU Bandwidth = 655.360 MB/sec [data size = 32768 bytes]

0 Likes
45 Replies
xero
Journeyman III

Hi Joern,

I got the similar PCI speed test results. The GPU->CPU is very slow.

(CPU: intel i5 750,  MB: intel P55, GPU: HD5870, OS: Linux 2.6.18 i386, Driver: 10.2)

Do you have any progress on this?

 

 

0 Likes

 xero,

yes we made a (negative) progress. Now we've testet the PCIe transfer rates under Windows 7 x86_64 with the PCIe SpeedTest v0.2 and also with Sissoft Sandra. Unfortunatly the same results.

CPU-> GPU is about 4GB/s
GPU-> CPU is about 450 MB/s

We also checked the machine against different other benchmarks(3DMark, Sisoftsandra, Unigine Heaven). The measures like the host memory bandwith, the GPU perfomance, CPU speed etc. look quiet well. Only the PCIe transfer rates, especially form the GPU-to-CPU is around 6% of it's theorethical maximum.

 

To say it explicitly: we use the recent drivers from ATI (10.3) and Intel (chipset autoinst 911) . Also we've flashed the BIOS to latest version( v808) .

Furthermore I played around with the BIOS configs, eg. diabled the sleep states, manually configured the memory, manually set the QPI-Interface etc.pp. - no change at all...

Thus, we now suspect a chipset bug within the Intel x58 northbridge rathen than a ATI driver bug. I've read similar transfer centric issues around the x58 chipset on a german site: http://www.planet3dnow.de/vbulletin/showthread.php?t=364174

We will clear this issue on monday because we want to replace the HD5870 by an GTX 270... Stay tuned...

Joern

 

0 Likes

Hi Joern,

Thanks for the information.

I tried to install a HD4870 on the P55 mainboard. The result is as slow as the 5870.

I also intalled the 5870 on a P45 mainboard. The CPU->GPU/GPU->CPU speed can reach ~5GB/s.

 

0 Likes

Hi xero,

I can confirm good results with a HD5850 on a P43 board / Core2Quad Q6700. The transfer bandwith reaches at peak:

CPU -> GPU : ~5 GB/s
GPU -> CPU : ~6 GB/s

So there will be two solutions left, either there is a ATI driver issue related to the x58 or the problem is inside the X58 chip.

I'm curious about the test on monday the 04-12-2010 with the Nvidia card...

Thank you...

 

 

 

0 Likes

Hi,

I am also interested in high GPU -> CPU bandwidth, for off-screen rendering of large images.

There seems to be an issue with the X58 chipset and ATI cards, especially the new 5850 / 5870 ones, severely limiting the readback bandwidth.

For comparison, my 4850 with 1GB on i7-920 with 6 GB DDR3-1066C7 and Vista64 HP gets about 1.2 GB/s maximum readback and for larger blocks only 950 MB/s, of course tested with PCIe Speed Test (with glReadPixels I get lower values).

Even with the latest v4.0 beta drivers the problem hasn't been solved for radeons, but it *may* have been solved for FirePro cards. It's about using hardware DMA, there seems to be a problem to get it working on X58 systems. Of course, if the problem was solved for FirePro cards, there shouldn't be any *technical* reason for it to be left unsolved for Radeons.

I raised this isue in the OpenGL forums, you can have a look at this thread: http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=275081#Post275081

 

0 Likes

Thanks for sharing shis info>

0 Likes

Hi xero, Tzupy,

we now have tested the system under windows 7 x86_64 on the x58 board with a NVidia GTX 275. The problems are gone.

The GTX reaches under Sissoft Sandra (using OpenCL):

CPU -> GPU : 5.57 GB/s
GPU -> CPU : 5.27 GB/s

With the HD5870 we measured with Sandra (using OpenCL, Stream and Direct Compute) and the PCIe SpeedTest (ATI Stream):

CPU-> GPU : ~ 4GB/s
GPU-> CPU : ~ 450 MB/s

So we now can say we hit a driver bug related to the x58 because the problem doesn't show up under a p45 chipset or an AMD mainboard.

Maybe the problem is a regession because there was a problem with x58 mainboards in 11/2008: (ATI Catalyst x58 Hotfix http://ht4u.net/news/2655_neue_geforce-treiber_von_nvidia_und_ati-catalyst-hotfix_fuer_x58/

Now it's time for ATI to react.

Joern

 

 

0 Likes

thx Joern.

Have you send a message to AMD yet?

0 Likes

Hi xero,

no. The reason is I dotn't know how. Is there a bug track system or a hotline?

0 Likes

Hi,

Reporting this could be done by sending a pm to an AMD employee, but I believe they already know about it, considering that a post with X58 low readback speed was made a year ago.

The problem doesn't seem to affect only Intel X58 / P55 chipsets, but also some AMD-only configurations, but to a lesser extent. Probably lower readback can happen because of various chipset / BIOS / driver / OS interactions that prevent the PCIe handshaking to deliver the highest possible readback bandwidth.

So I guess the X58 with 5850 / 5870 is a worst case scenario. And I wouldn't bet on it being fixed soon for Radeons. After all, AMD wants to sell the new 1,500+ euros Firepro 8800.

My 4850 when mounted in my backup computer, an X2 5050e (2.6 GHz) with 4GB DDR2-800C5 on 785G mobo and XP, gets about 3.24 GB/s upload ( 4.4 GB/s on X58 ) and 2.31 GB/s readback, for large blocks in PCIe Speed Test. With a 4670 I get 3.28 GB/s upload and 2.87 GB/s readback.

 

0 Likes

To report any issues to AMD, please send an email to streamdeveloper@amd.com, please do not private message AMD employee's.
0 Likes

Hi,

we've reported the problem to AMD and, over the pc manufacturer also to Intel. Maybe there will be a solution in the next Catalyst package.

Otherwise the card is quite useless for us...

joern

0 Likes

I have a similar problem of poor performance to readback from GPU > CPU

Ubuntu 9.10 64bit - ATI driver 10.3 - INTEL i7 975Extreme - SAPPHIRE ATI HD 5970 OC - Mothrboard EVGA 4-WAY-SLI - 6 GB RAM corsair cmg6gx3m3a2000c8

PCIeSpeedTest_v0.2/PCIeSpeedTest' -tdf pcietest1
Devices found: 2

===> Testing device 0 <===
Device type: Unknown
Max resource 2D width/height: 16384/16384
Total GPU memory size: 1024 MB
Total CPU cached space size: 508 MB
Total CPU uncached space size: 1279 MB
GPU engine clock: 1000 MHz
GPU memory clock: 1500 MHz
Number of timing loops: 100
[        16 bytes] CPU->GPU= 800.000 KB/sec, GPU->CPU= 800.000 KB/sec
[        32 bytes] CPU->GPU=   1.600 MB/sec, GPU->CPU=   3.200 MB/sec
[        64 bytes] CPU->GPU= 914.286 KB/sec, GPU->CPU=   1.280 MB/sec
[       128 bytes] CPU->GPU=   4.267 MB/sec, GPU->CPU=   6.400 MB/sec
[       256 bytes] CPU->GPU=  12.800 MB/sec, GPU->CPU=  12.800 MB/sec
[       512 bytes] CPU->GPU=  25.600 MB/sec, GPU->CPU=  25.600 MB/sec
[      1024 bytes] CPU->GPU=  51.200 MB/sec, GPU->CPU=  51.200 MB/sec
[      2048 bytes] CPU->GPU= 102.400 MB/sec, GPU->CPU=  34.133 MB/sec
[      4096 bytes] CPU->GPU= 204.800 MB/sec, GPU->CPU= 204.800 MB/sec
[      8192 bytes] CPU->GPU= 409.600 MB/sec, GPU->CPU= 409.600 MB/sec
[     16384 bytes] CPU->GPU= 819.200 MB/sec, GPU->CPU= 819.200 MB/sec
[     32768 bytes] CPU->GPU=   1.638 GB/sec, GPU->CPU=   1.638 GB/sec
[     65536 bytes] CPU->GPU=   2.185 GB/sec, GPU->CPU=   1.638 GB/sec
[    131072 bytes] CPU->GPU=   3.277 GB/sec, GPU->CPU= 569.878 MB/sec
[    262144 bytes] CPU->GPU=   3.745 GB/sec, GPU->CPU= 689.853 MB/sec
[    524288 bytes] CPU->GPU=   4.033 GB/sec, GPU->CPU= 873.813 MB/sec
[   1048576 bytes] CPU->GPU=   4.559 GB/sec, GPU->CPU= 852.501 MB/sec
[   2097152 bytes] CPU->GPU=   4.766 GB/sec, GPU->CPU= 803.507 MB/sec
[   4194304 bytes] CPU->GPU=   4.821 GB/sec, GPU->CPU= 824.028 MB/sec
[   8388608 bytes] CPU->GPU=   4.906 GB/sec, GPU->CPU= 820.803 MB/sec
[  16777216 bytes] CPU->GPU=   4.949 GB/sec, GPU->CPU= 819.200 MB/sec
[  33554432 bytes] CPU->GPU=   4.964 GB/sec, GPU->CPU= 815.418 MB/sec
[  67108864 bytes] CPU->GPU=   4.975 GB/sec, GPU->CPU= 815.517 MB/sec
[ 134217728 bytes] CPU->GPU=   4.977 GB/sec, GPU->CPU= 812.161 MB/sec
[ 268435456 bytes] CPU->GPU=   4.980 GB/sec, GPU->CPU= 810.004 MB/sec
[ 536870912 bytes] CPU->GPU=   4.981 GB/sec, GPU->CPU= 810.787 MB/sec
calResAllocLocal2D() returned an error when trying to allocate 1073741824 bytes!
Peak CPU->GPU Bandwidth =   4.981 GB/sec [data size = 536870912 bytes]
Peak GPU->CPU Bandwidth =   1.638 GB/sec [data size = 32768 bytes]

===> Testing device 1 <===
Device type: Unknown
Max resource 2D width/height: 16384/16384
Total GPU memory size: 1024 MB
Total CPU cached space size: 508 MB
Total CPU uncached space size: 1279 MB
GPU engine clock: 1000 MHz
GPU memory clock: 1500 MHz
Number of timing loops: 100
[        16 bytes] CPU->GPU= 800.000 KB/sec, GPU->CPU= 800.000 KB/sec
[        32 bytes] CPU->GPU=   1.067 MB/sec, GPU->CPU= 457.143 KB/sec
[        64 bytes] CPU->GPU=   2.133 MB/sec, GPU->CPU=   3.200 MB/sec
[       128 bytes] CPU->GPU= 984.615 KB/sec, GPU->CPU=   6.400 MB/sec
[       256 bytes] CPU->GPU=   3.657 MB/sec, GPU->CPU=   8.533 MB/sec
[       512 bytes] CPU->GPU=  25.600 MB/sec, GPU->CPU=  17.067 MB/sec
[      1024 bytes] CPU->GPU=   8.533 MB/sec, GPU->CPU=  51.200 MB/sec
[      2048 bytes] CPU->GPU= 102.400 MB/sec, GPU->CPU= 102.400 MB/sec
[      4096 bytes] CPU->GPU= 204.800 MB/sec, GPU->CPU= 136.533 MB/sec
[      8192 bytes] CPU->GPU= 409.600 MB/sec, GPU->CPU= 273.067 MB/sec
[     16384 bytes] CPU->GPU= 136.533 MB/sec, GPU->CPU= 819.200 MB/sec
[     32768 bytes] CPU->GPU=   1.638 GB/sec, GPU->CPU=   1.092 GB/sec
[     65536 bytes] CPU->GPU=   2.185 GB/sec, GPU->CPU=   1.638 GB/sec
[    131072 bytes] CPU->GPU=   2.621 GB/sec, GPU->CPU= 624.152 MB/sec
[    262144 bytes] CPU->GPU=   3.277 GB/sec, GPU->CPU= 689.853 MB/sec
[    524288 bytes] CPU->GPU=   4.369 GB/sec, GPU->CPU= 873.813 MB/sec
[   1048576 bytes] CPU->GPU=   4.559 GB/sec, GPU->CPU= 832.203 MB/sec
[   2097152 bytes] CPU->GPU=   4.766 GB/sec, GPU->CPU= 809.711 MB/sec
[   4194304 bytes] CPU->GPU=   4.821 GB/sec, GPU->CPU= 820.803 MB/sec
[   8388608 bytes] CPU->GPU=   4.906 GB/sec, GPU->CPU= 820.001 MB/sec
[  16777216 bytes] CPU->GPU=   4.964 GB/sec, GPU->CPU= 821.205 MB/sec
[  33554432 bytes] CPU->GPU=   4.971 GB/sec, GPU->CPU= 816.807 MB/sec
[  67108864 bytes] CPU->GPU=   4.956 GB/sec, GPU->CPU= 818.600 MB/sec
[ 134217728 bytes] CPU->GPU=   4.969 GB/sec, GPU->CPU= 818.850 MB/sec
[ 268435456 bytes] CPU->GPU=   4.973 GB/sec, GPU->CPU= 817.578 MB/sec
[ 536870912 bytes] CPU->GPU=   4.978 GB/sec, GPU->CPU= 819.038 MB/sec
calResAllocLocal2D() returned an error when trying to allocate 1073741824 bytes!
Peak CPU->GPU Bandwidth =   4.978 GB/sec [data size = 536870912 bytes]
Peak GPU->CPU Bandwidth =   1.638 GB/sec [data size = 65536 bytes]

 

0 Likes

0 Likes

I'm getting the same thing with the 5870/5970 and firepro, asus P6x58.

 

 

0 Likes

Hi charliex,

to be more specific, does the problem of the pci transferrates occur on an HD 58xx-series based firepro?

And if not allready done, please reported the bug to amd. Maybe the issue gets a higher priority when they see that not only their consumer line is affected...

joern

0 Likes

Hi,

just an update: the problem isn't gone with the catalyst 10.4 driver (tested under linux)

joern

0 Likes

A similar CPU-GPU perf problem here:

ASUS P7P55D-E Premium; i7 860, 8GB 1333 Kingston RAM; A single Sapphire 5850 in the 1st PCIe 2.0 x16 slot (the 2nd x16 slot is unoccupied); Win7 x64 Ult.

The GPU is slightly overclocked, but the problem verified to exist at the factory clock as well. 

Tested with 10.2, 10.3 and the latest 10.4.

I am working on a GPGPU app and this little problem rains on the whole thing. I filed a service request and had a one-way lively discussion with the support ex machina.

Unless this gets resolved soon, I may have to go the Fermi route.

Alex.

Devices found: 1

 

===> Testing device 0 <===

Device type: Unknown

Max resource 2D width/height: 16384/16384

Total GPU memory size: 1024 MB

Total CPU cached space size: 1467 MB

Total CPU uncached space size: 1467 MB

GPU engine clock: 765 MHz

GPU memory clock: 1115 MHz

Number of timing loops: 100

[        16 bytes] CPU->GPU= 101.163 KB/sec, GPU->CPU= 540.209 KB/sec

[        32 bytes] CPU->GPU=   1.117 MB/sec, GPU->CPU=   1.125 MB/sec

[        64 bytes] CPU->GPU=   2.232 MB/sec, GPU->CPU=   2.111 MB/sec

[       128 bytes] CPU->GPU=   4.314 MB/sec, GPU->CPU=   4.607 MB/sec

[       256 bytes] CPU->GPU=   2.671 MB/sec, GPU->CPU=   9.165 MB/sec

[       512 bytes] CPU->GPU=  16.947 MB/sec, GPU->CPU=  16.111 MB/sec

[      1024 bytes] CPU->GPU=  36.957 MB/sec, GPU->CPU=  36.759 MB/sec

[      2048 bytes] CPU->GPU=  73.875 MB/sec, GPU->CPU=  69.532 MB/sec

[      4096 bytes] CPU->GPU= 137.162 MB/sec, GPU->CPU= 148.082 MB/sec

[      8192 bytes] CPU->GPU= 294.147 MB/sec, GPU->CPU= 295.385 MB/sec

[     16384 bytes] CPU->GPU= 589.839 MB/sec, GPU->CPU= 487.206 MB/sec

[     32768 bytes] CPU->GPU=   1.176 GB/sec, GPU->CPU= 669.146 MB/sec

[     65536 bytes] CPU->GPU=   2.174 GB/sec, GPU->CPU= 676.652 MB/sec

[    131072 bytes] CPU->GPU=   3.016 GB/sec, GPU->CPU= 577.624 MB/sec

[    262144 bytes] CPU->GPU=   3.514 GB/sec, GPU->CPU= 540.532 MB/sec

[    524288 bytes] CPU->GPU=   3.795 GB/sec, GPU->CPU= 581.861 MB/sec

[   1048576 bytes] CPU->GPU=   3.817 GB/sec, GPU->CPU= 559.868 MB/sec

[   2097152 bytes] CPU->GPU=   4.085 GB/sec, GPU->CPU= 545.592 MB/sec

[   4194304 bytes] CPU->GPU=   4.344 GB/sec, GPU->CPU= 544.993 MB/sec

[   8388608 bytes] CPU->GPU=   4.107 GB/sec, GPU->CPU= 549.369 MB/sec

[  16777216 bytes] CPU->GPU=   4.314 GB/sec, GPU->CPU= 535.173 MB/sec

[  33554432 bytes] CPU->GPU=   4.332 GB/sec, GPU->CPU= 540.510 MB/sec

[  67108864 bytes] CPU->GPU=   4.358 GB/sec, 



 

 

 

0 Likes

Hello abab,

this is intresting because you find out that not only the enthusiasts x58 is affected but also the recent Intel mainline chipsets in conjunction with ati graphic cars.

To sum up: FireStreams and Radeons on current Intel boards suffer from this issue under the driver on Linux and Windows.

My problem now is that I have 20x HD5870 cards, with the very first purpose to compute our gpgpu stuff on them. This issue renders them useless.

The next days I will speak with our hardware dealer and try to find a solution. Maybe we are forced to go the fermi way too... 😕

joern

 

 

0 Likes

any news about this issue? I just bought an HD5850 and I am doing an OpenCL project for the university. I hope this won't be a big problem

0 Likes

The issue is being looked into by developers, also the issue exists only for certain Intel boards.

0 Likes

Originally posted by: omkaranathan The issue is being looked into by developers, also the issue exists only for certain Intel boards.

 

I've bought a Gigabyte GA-X58A-UD3R 1366 motherboard. will I be affected by this issue?

thanks

0 Likes
xero
Journeyman III

Originally posted by: mux85
Originally posted by: omkaranathan The issue is being looked into by developers, also the issue exists only for certain Intel boards.

 

I've bought a Gigabyte GA-X58A-UD3R 1366 motherboard. will I be affected by this issue?

thanks

For the intel MB with IOH chip (such as X58/P55), I guess so. : (

0 Likes

I've read in an other forum (anandtech i guess) the P55, Q55 etc. are also been affacted. We have not testet it, but it is very likely.

 

Originally posted by: xero
Originally posted by: mux85
Originally posted by: omkaranathan The issue is being looked into by developers, also the issue exists only for certain Intel boards.

 

 

 

 

I've bought a Gigabyte GA-X58A-UD3R 1366 motherboard. will I be affected by this issue?

 

thanks

 

 

For the intel MB with IOH chip (such as X58/P55), I guess so. : (

 

0 Likes

Hi,

there is no improvement with the Catalyst driver 10.5 or 10.4. I can't   measure a difference with PCISpeedTest 0.2. Hopefully the GPGPU related show stopper is still in scope.

joern

0 Likes

Well, in my case ATI's advice on being patient has reached its time limit. I am no stranger and can appreciate  technical difficulties in software/hardware development  - if ati/amd would just provide some visibility into the problem and its resolution effort.

Sans that,  I can't wait forever - so I am off to Fermi lands for now.

Will be back - maybe.

Alex.

 

0 Likes
Tzupy
Journeyman III

Still no improvement for me (i7-920, 4850 1GB, Vista64) with 10.6 drivers.

But an OpenGL bug I reported some time ago has been fixed...

0 Likes

Hi Tzupy and all others,

there is also no improvement for me with 10.6.

These days were are in contact with Intel and ASUS.

Intel says there is a speed adjustment problem with the card related to that it has PCI Express 2.1 and the board only supports PCIe 2.0. Hmmm. As far as I know all Intel boards in special and all other mainboars in general only support PCIe 2.0. But the cause itself sounds not bad. From the beginning we guess that there is a handshake problem between the host- and device-interfaces because the transfer speeds semms to be limited to 4x or something.

Asus (Germany) on the other side confirmed the problem. The have tested the P6T SE with the PCIExpressSpeedTest form ATI and got the same results. In addition they tried a game benchmark and supposingly 😆 don't find any flaws. They ask us to send them a "real" world application wich suffers from the issue. I'll send them one the next days.

The game benchmark example Asus mentioned is the very reason why this flaw isn't in scope of the ATI developers. It just plays no role for their major customers - the gamers and board vendors. But they also promote their cards as "GPGPUs" and should act as professionals.

As an example how other vendors care about their customers let me talk about a recent event. I've spottet a heavy bug in the NVIDIA OpenCL compiler (see code). They have a website related only for professional developers. There you can file a bug. Three hours after I've done this an employee ask me for an demo-program, instructions to start it and the output of a bug-report script. One work day later the flaw was fixed and the bugfix was added to compiler. This was today. The compiler will be released in the next driver release the next days.

Regards,

Jörn

 

 

OpenCL kernel code: char c = -1; float f; double d; f = c; d = c; // result: "f" or "d" is not "-1" but "255" // also wrong: // f = (signed) c; // f = (signed char) c // f = (int) c; // f = (signed int) c ...

0 Likes

Hi all,

there are good news for all AMD/ATI customers using their 5xxx-cards on an x58 board for GPGPU under Linux. The new Catalyst 10.7 fixes the PCIe performance issue. With this driver we measure the maximum possible (real-life) interface bandwidth in both directions:

[        16 bytes] CPU->GPU= 320.000 KB/sec, GPU->CPU= 200.000 KB/sec
[        32 bytes] CPU->GPU= 640.000 KB/sec, GPU->CPU= 640.000 KB/sec
[        64 bytes] CPU->GPU=   1.280 MB/sec, GPU->CPU=   2.133 MB/sec
[       128 bytes] CPU->GPU= 800.000 KB/sec, GPU->CPU=   2.560 MB/sec
[       256 bytes] CPU->GPU=   6.400 MB/sec, GPU->CPU=   3.200 MB/sec
[       512 bytes] CPU->GPU=  17.067 MB/sec, GPU->CPU=  25.600 MB/sec
[      1024 bytes] CPU->GPU=  34.133 MB/sec, GPU->CPU=  51.200 MB/sec
[      2048 bytes] CPU->GPU=  68.267 MB/sec, GPU->CPU=  68.267 MB/sec
[      4096 bytes] CPU->GPU= 136.533 MB/sec, GPU->CPU= 204.800 MB/sec
[      8192 bytes] CPU->GPU=  91.022 MB/sec, GPU->CPU= 409.600 MB/sec
[     16384 bytes] CPU->GPU= 546.133 MB/sec, GPU->CPU= 819.200 MB/sec
[     32768 bytes] CPU->GPU=   1.638 GB/sec, GPU->CPU=   1.092 GB/sec
[     65536 bytes] CPU->GPU=   2.185 GB/sec, GPU->CPU=   2.185 GB/sec
[    131072 bytes] CPU->GPU=   1.192 GB/sec, GPU->CPU=   3.277 GB/sec
[    262144 bytes] CPU->GPU=   5.243 GB/sec, GPU->CPU=   5.243 GB/sec
[    524288 bytes] CPU->GPU=   5.825 GB/sec, GPU->CPU=   4.766 GB/sec
[   1048576 bytes] CPU->GPU=   6.168 GB/sec, GPU->CPU=   4.993 GB/sec
[   2097152 bytes] CPU->GPU=   5.992 GB/sec, GPU->CPU=   6.554 GB/sec
[   4194304 bytes] CPU->GPU=   6.079 GB/sec, GPU->CPU=   6.658 GB/sec
[   8388608 bytes] CPU->GPU=   6.123 GB/sec, GPU->CPU=   6.658 GB/sec
[  16777216 bytes] CPU->GPU=   6.123 GB/sec, GPU->CPU=   6.684 GB/sec
[  33554432 bytes] CPU->GPU=   5.483 GB/sec, GPU->CPU=   6.711 GB/sec
[  67108864 bytes] CPU->GPU=   5.297 GB/sec, GPU->CPU=   6.704 GB/sec
[ 134217728 bytes] CPU->GPU=   5.320 GB/sec, GPU->CPU=   6.704 GB/sec
[ 268435456 bytes] CPU->GPU=   5.313 GB/sec, GPU->CPU=   6.706 GB/sec
[ 536870912 bytes] CPU->GPU=   5.092 GB/sec, GPU->CPU=   6.687 GB/sec
calResAllocLocal2D() returned an error when trying to allocate 1073741824 bytes!
Peak CPU->GPU Bandwidth =   6.168 GB/sec [data size = 1048576 bytes]
Peak GPU->CPU Bandwidth =   6.711 GB/sec [data size = 33554432 bytes]

 

Thank you AMD/ATI developers.

Joern

 

0 Likes

Hi again,

one thing: under Windows 7 x64, there are no improvements with 10.7 / 10.7 OpenGL ES preview. Maybe the fix will come soon...

Joern

 

0 Likes

No improvement for me too under Vista64.

I wanted to replace my aging 4850 with a new GTX 460, but Fermi cards seem to have similar issues with readback.

http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=279487#Post279487

0 Likes

so the issue is resolved for 32 bit systems but not for 64 bit ones? anyway the fact that it can be resolved with a driver update is a good new. thanks

0 Likes

Hi all, 

I have confirmed jhoffmann's results on 64bit linux (Ubuntu 9.10) system.

With 10.5, I have obtained ~ 674 MB/sec with 67108864 bytes for GPU->CPU.

With 10.7, I got 5.765 GB/sec! Even faster than CPU->GPU.

0 Likes
xero
Journeyman III

Originally posted by: nnsan Hi all, 

I have confirmed jhoffmann's results on 64bit linux (Ubuntu 9.10) system.

With 10.5, I have obtained ~ 674 MB/sec with 67108864 bytes for GPU->CPU.

With 10.7, I got 5.765 GB/sec! Even faster than CPU->GPU.

unfortunately, for my machine (HD4870x2, redhat 5/ 64 bit), driver 10.7 is as slow as before.

 

0 Likes

Same problem with newest catalyst 10.7b driver which is like guys talked earlier. GPU to CPU PCIE bandwidth's as low as ~800MB/s

HD5870 with ASUS P6T7 WS X58, OS is Windows 7 x64, 12GB (6x2GB config) DDR3.

hope fix this soon on Windows x64 systems.

0 Likes
mensjeans
Journeyman III

I tried to install a HD4870 on the P55 mainboard. The result is as slow as the 5870.I also intalled the 5870 on a P45 mainboard. The CPU->GPU/GPU->CPU speed can reach ~5GB/s and thank you very much for sharing!

0 Likes

Still no improvement for me with the new 10.8 drivers.

How difficult to fix can this be? |-(

0 Likes

it can be HW issue so it can be imposible to fix. as you can see even fermi card have this issue. change your motherboard.

0 Likes
Tzupy
Journeyman III

No, it's not a HW issue, it's a driver issue, that has been recently fixed for some Linux flavors, but still not for Windows.

And the Fermi cards have similar issues due to buggy drivers, while the older GTX 28x cards do not, IMO because the drivers for them are mature.

0 Likes