cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

drallan
Challenger

Slow PCIe bus with single GPU in x16 slots, seeking cause

PCIE performance becomes important when running multiple high end cards like the new 7970, which when over clocked, runs at almost 5 teraflops.

Trying to optimize a kernel, I discovered that my PCIE bus is limited to 1.6GB/s, read and write, where it should be about 5-6 GB/s in a v2.0 x16 slot. I've tried several GPUs, one at a time, in every slot and always get the same numbers. I also updated and main board bios and drivers, AMD drivers, and tried every BIOS configuration, the whole works.

I get identical numbers from programs like PCIeBandwidth, PCIspeedtest(v0.2), and my own code using all the suggested methods from the AMD APP Opencl Programming Guide (Dec 2011) for fast path transfers (a good read). The numbers I get are:

PCIe x4 slot, transfer rate=1.40 GB/s (read and write) (one card)

PCIe x16at x8, transfer rate=1.65 GB/s (read and write) (requires 2 cards)

PCIe x16 at x16, transfer rate=1.65 GB/s (read and write) (one card)

Also note the 1.40GB/s rate for the x4 slot is correct, extrapolated to x16 it would be 5.6GB/s. The x16 slots are faster but not by much. According to GPU-z, the x16 slot is running at x16 v2.0 pcie mode.

PCI problems can be due to a combination of factors, but I doubt a pure hardware problem because I've tried 6870, 6970, and 7970 GPUs, and because I'm using a new top end main board specifically designed for high PCIE performance with 3 way Crossfire (ASRock z68-Extreme7), with 5 PCIE slots, one for PCIE v3.0, 3 for multi GPUs (x16, x16, 0) or (x16, x8, x8) and a dedicated x4 slot. It also uses PLX PEX8608 and NL200 chips to increase PCI lanes.

I'm currently using the new 12.3 drivers (dated Feb 16, 2012).

I've worked with PCIE before and know how complex these problems can be. Any help or feedback is greatly appreciated, particularly recent measurements of PCIE bus performance. Any help from AMD devgurus is also welcome (of course).

I will add anything useful that I learn in this thread.

Many thanks.

Other info:

Crossfire is not connected or selected.

GPU-z says the cards are running x16 v2.0 when in x16 slots.

GPU-z reports 7970s as x16 V3.0 running at x16-v1.1. When the GPU is loaded it switches to run at x16-v2.0. This does not affect the PCIE low performance problem.

0 Likes
13 Replies
drallan
Challenger

Update:1

Using CL_MEM_USE_PERSISTENT_MEM_AMD to copy directly to an on-device buffer will transfer the data a full bandwidth (5.6GB/s). Thus, the problem is not likely to be hardware.

All other recommended options that promise peak interconnect bandwidth by using pinned or pre-pinned memory are slow (1.6GB/s),  the same speed as non-pinned memory.

The question is then, why does pinned memory not work at full speed, is the memory really pinned?

The following requirements for pinned memory were followed-

1. All buffers are aligned to 4BK boundaries

2. The buffers are not used as kernel arguments when transferring/mapping.

Again, any input is appreciated.

drallan

0 Likes

To have a common language please run the AMD SDK sample 'BufferBandwidth' with the arguments '-t 3 -if 3 -of 3'  and post the console output.

It performs pre-pinned memory transfer operations and it will help us to better understand the issue.

What OS are you using?

0 Likes

Hello tzachi.cohen, thanks for responding. I am using Windows 7 x_64 service pack 1

Below I show output for two tests

1. BufferBandwidth -t 3  -if 3 -of 3

2. BufferBandwidth  -t 3  -if 0 -of 1

My understanding is that for case 2, EnqueueRead/WriteBuffer write to the GPU before the kernel is enqueued.

drallan

Microsoft Windows [Version 6.1.7601]

******************************************************************

Command line: D:\>bufferbandwidth -d 2 -t 3  -if 3 -of 3

******************************************************************

PCIE slot           1 [x16 v3.0] running at [x16 v2.0]

Device 2            Tahiti

Build:               _WINxx release

GPU work items:      32768

Buffer size:         33554432

CPU workers:         1

Timing loops:        20

Repeats:             1

Kernel loops:        20

inputBuffer:         CL_MEM_READ_ONLYCL_MEM_USE_HOST_PTR

outputBuffer:        CL_MEM_WRITE_ONLYCL_MEM_USE_HOST_PTR

copyBuffer:          CL_MEM_READ_WRITECL_MEM_ALLOC_HOST_PTR

Host baseline (single thread, naive):

Timer resolution  301.861 ns

Page fault  506.466

Barrier speed  61.21581 ns

CPU read   12.917 GB/s

memcpy()   6.3788 GB/s

memset(,1,)   16.3768 GB/s

memset(,0,)   16.3993 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Mapping copyBuffer as mappedPtr

             clEnqueueMapBuffer:  0.000007 s [  5002.27 GB/s ]

2. Host CL write from mappedPtr to inputBuffer

           clEnqueueWriteBuffer:  0.004682 s       7.17 GB/s

3. GPU kernel read of inputBuffer

       clEnqueueNDRangeKernel():  0.548830 s       1.22 GB/s

                verification ok

4. GPU kernel write to outputBuffer

       clEnqueueNDRangeKernel():  0.863300 s       0.78 GB/s

5. Host CL read of outputBuffer to mappedPtr

            clEnqueueReadBuffer:  0.004887 s       6.87 GB/s

                 verification ok

6. Unmapping copyBuffer

      clEnqueueUnmapMemObject():  0.000039 s [   850.00 GB/s ]

Passed!

******************************************************************

Command line: D:\>bufferbandwidth -d 2 -t 3  -if 0 -of 1

******************************************************************

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Mapping copyBuffer as mappedPtr

             clEnqueueMapBuffer:  0.000011 s [  2929.59 GB/s ]

2. Host CL write from mappedPtr to inputBuffer

           clEnqueueWriteBuffer:  0.019592 s       1.71 GB/s

3. GPU kernel read of inputBuffer

       clEnqueueNDRangeKernel():  0.004592 s     146.13 GB/s

                verification ok

4. GPU kernel write to outputBuffer

       clEnqueueNDRangeKernel():  0.005429 s     123.62 GB/s

5. Host CL read of outputBuffer to mappedPtr

            clEnqueueReadBuffer:  0.019979 s       1.68 GB/s

                 verification ok

6. Unmapping copyBuffer

      clEnqueueUnmapMemObject():  0.000040 s [   836.50 GB/s ]

Passed!

******************************************************************

0 Likes

It seems the pci-e bus is under utilized in the second run. On my machine it runs with full pci-e speed.

I suggest you run an independent pci-e performance test and if the problem persist contact the motherboard distributor.

Wow, thanks for the fast reply.

Just to be clear, can you to post your numbers from the second test?

What OS/driver did you use?

Again, thanks.

0 Likes
revisionfx
Journeyman III

Here to report that we discovered a bug with some new Asus motherboard along that line

Here's a snapshot of Bios (Sandy Bridge Extreme 6 cores)

http://www.revisioneffects.com/bugreports/Bios.jpg

We have 3 tests computers, 2 have that motherboard with that bios, another also an Asus one but a different one (a 4 core SB).  It's what Computer Central dropped in our case for some lab testing.

And 4 7970 (2 XFF and 2 Sapphire).

That's what it needed to isolate the issue

Diagnostic:  RAM to GPU memory transfer is real slow (via openCL) on the Sandy Bridge 6 core based motherboard we have. Cards are fine as it works fine on other motherboard (in 2.0 PCI).  On that motherboard either on PCI 2 or 3 slot it's real slow.

Question 1: Anyone know how to get to tier 2 at Asus to report such problems? Not as easy as you think (aside getting a phone machine...)

I do see they upgraded the bios for their own 7970 but we tried that and it makes no difference.

Question 2:  I tried BufferBandwidth and it's not as clear as drallan, but looks like the inverse, the first run is the wacko looks like in our case. I am not too use to have to worry about PCI. What is an  independent pci-e performance test???

This is under Windows 7 with as far as know all the latest OpenCL stuff...

- pierre

(jasmin at revisionfx dot com)

0 Likes

Don't know if it helps but I did learn more about my slow PCI problems.

It appears to be the motherboard design and relates to using PLX PCIe switches and NF200 chips to get more lanes to the PCIe cards.

My (Z68) board has 5 PCIe slots.

  1. One x16. When used, all the other 4 slots are turned off, you can use only 1 GPU, This slot works at full speed (x16 or 5.7GB/s)
  2. A group of 4 slots that are being switched by the NF200 and PLX chips. All of these slots run very slow no matter what the configuration, even 1 card in the fastest x16 slot.

My board sounds like your "good" board. All slow except one slot. Your two X79 boards are a surprise though, the X79 has about double the lane capacity, but may also have some design issues.

More than one motherboard manufacture has been a bit deceptive about PCIe performance. Several manuals caution to not use certain slot combinations for performance reasons, but their advertising leads you to think these combinations have the same performance.

Sorry, I don't know how to get the manufactures' attention on such problems. Some forums for specific brand boards sometimes have company reps that occasionally respond.

Good luck,

  drallan

0 Likes

this boards are designed for SLI/Crossfire setups where slower PCIe comunications don't introduce much slowdown in games. there are test where they artificaly cutted full 16x slot to 8x 4x 2x and even 1x. even 4x PCIe slot introduce only small slowdown (around 10-15%) in games.

0 Likes

"It appears to be the motherboard design and relates to using PLX PCIe

switches and NF200 chips to get more lanes to the PCIe cards."

Thanks,

I see that Relee at 3/10/12 at 8:05pm relayed that he found a good hint

with a different motherboard - same problem

http://www.overclock.net/t/1223512/s-l-o-w-pcie-bus-with-single-gpu-6970-7970-in-any-slot-why

says he has the same problem as me basically with a different

motherboard, he says the bus refuses to switch to burst mode.

I just asked him if he found out more since.

I remember seeing a similar issue with an Nvidia driver for linux versus

on windows on same machine. I wonder if the AMD openCL driver has a role

in there?

Pierre

0 Likes

Nou is correct, the gamers will not notice this problem so much.

revisionfx wrote:

I remember seeing a similar issue with an Nvidia driver for linux versus

on windows on same machine. I wonder if the AMD openCL driver has a role

in there?

No, I don't think the AMD drivers have anything to do with the slowdown.

From what I found AMD bufferbandwidth.exe is probably the best way to test PCIE transfer speed. Can you try that and post the numbers, I would be interested to see. Use these settings

Command line: D:\>bufferbandwidth -d 2 -t 3  -if 0 -of 1

PS, the RAMPAGE motherboard has a small 6-pin PCIE power connector very near PCIE slot 1 near the audio output. Asus recommends plugging power into that when using many GPUs. (I think this is only for Xfire, but most boards have this) Maybe you can plug that in.

0 Likes

Uh!... Looks like there is the same issue with new nVidia 680 cards

drivers on X79 motherboards

http://www.techpowerup.com/162942/GeForce-GTX-680-Release-Driver-Limits-PCI-Express-to-Gen-2.0-on-X79-SNB-E-Systems.html

(and something like x8 instead of x16 speed as well)

Is it the same with AMD drivers (and is this only openCL, haven't tried

yet if it makes a difference to read pixels from RAM with openGL and

interop to openCL that memory)?

Pierre

0 Likes

Also I lied with regards to first motherboard specs (was working from

memory)

Asus P8Z68-V PRO/GEN3 i7 i5 i3 LGA1155 Z68 DDR3 - work as expected

(transfers match specs)

Asus RAMPAGE IV Extreme i7 X79 LGA 2011 DDR3 PCIE - slow transfers

0 Likes
rocky67
Journeyman III

Please advise wht is the best option when there are the same numbers being shown in different programs such as PCI Speedtest and PCUeBandwidth because I am stuck up

0 Likes