cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

drallan
Challenger
Challenger

Slow PCIe bus with single GPU in x16 slots, seeking cause

PCIE performance becomes important when running multiple high end cards like the new 7970, which when over clocked, runs at almost 5 teraflops.

Trying to optimize a kernel, I discovered that my PCIE bus is limited to 1.6GB/s, read and write, where it should be about 5-6 GB/s in a v2.0 x16 slot. I've tried several GPUs, one at a time, in every slot and always get the same numbers. I also updated and main board bios and drivers, AMD drivers, and tried every BIOS configuration, the whole works.

I get identical numbers from programs like PCIeBandwidth, PCIspeedtest(v0.2), and my own code using all the suggested methods from the AMD APP Opencl Programming Guide (Dec 2011) for fast path transfers (a good read). The numbers I get are:

PCIe x4 slot, transfer rate=1.40 GB/s (read and write) (one card)

PCIe x16at x8, transfer rate=1.65 GB/s (read and write) (requires 2 cards)

PCIe x16 at x16, transfer rate=1.65 GB/s (read and write) (one card)

Also note the 1.40GB/s rate for the x4 slot is correct, extrapolated to x16 it would be 5.6GB/s. The x16 slots are faster but not by much. According to GPU-z, the x16 slot is running at x16 v2.0 pcie mode.

PCI problems can be due to a combination of factors, but I doubt a pure hardware problem because I've tried 6870, 6970, and 7970 GPUs, and because I'm using a new top end main board specifically designed for high PCIE performance with 3 way Crossfire (ASRock z68-Extreme7), with 5 PCIE slots, one for PCIE v3.0, 3 for multi GPUs (x16, x16, 0) or (x16, x8, x8) and a dedicated x4 slot. It also uses PLX PEX8608 and NL200 chips to increase PCI lanes.

I'm currently using the new 12.3 drivers (dated Feb 16, 2012).

I've worked with PCIE before and know how complex these problems can be. Any help or feedback is greatly appreciated, particularly recent measurements of PCIE bus performance. Any help from AMD devgurus is also welcome (of course).

I will add anything useful that I learn in this thread.

Many thanks.

Other info:

Crossfire is not connected or selected.

GPU-z says the cards are running x16 v2.0 when in x16 slots.

GPU-z reports 7970s as x16 V3.0 running at x16-v1.1. When the GPU is loaded it switches to run at x16-v2.0. This does not affect the PCIE low performance problem.

0 Kudos
Reply
13 Replies
drallan
Challenger
Challenger

Re: Slow PCIe bus with single GPU in x16 slots, seeking cause

Update:1

Using CL_MEM_USE_PERSISTENT_MEM_AMD to copy directly to an on-device buffer will transfer the data a full bandwidth (5.6GB/s). Thus, the problem is not likely to be hardware.

All other recommended options that promise peak interconnect bandwidth by using pinned or pre-pinned memory are slow (1.6GB/s),  the same speed as non-pinned memory.

The question is then, why does pinned memory not work at full speed, is the memory really pinned?

The following requirements for pinned memory were followed-

1. All buffers are aligned to 4BK boundaries

2. The buffers are not used as kernel arguments when transferring/mapping.

Again, any input is appreciated.

drallan

0 Kudos
Reply
tzachi_cohen
Staff
Staff

Re: Slow PCIe bus with single GPU in x16 slots, seeking cause

To have a common language please run the AMD SDK sample 'BufferBandwidth' with the arguments '-t 3 -if 3 -of 3'  and post the console output.

It performs pre-pinned memory transfer operations and it will help us to better understand the issue.

What OS are you using?

0 Kudos
Reply
drallan
Challenger
Challenger

Re: Slow PCIe bus with single GPU in x16 slots, seeking cause

Hello tzachi.cohen, thanks for responding. I am using Windows 7 x_64 service pack 1

Below I show output for two tests

1. BufferBandwidth -t 3  -if 3 -of 3

2. BufferBandwidth  -t 3  -if 0 -of 1

My understanding is that for case 2, EnqueueRead/WriteBuffer write to the GPU before the kernel is enqueued.

drallan

Microsoft Windows [Version 6.1.7601]

******************************************************************

Command line: D:\>bufferbandwidth -d 2 -t 3  -if 3 -of 3

******************************************************************

PCIE slot           1 [x16 v3.0] running at [x16 v2.0]

Device 2            Tahiti

Build:               _WINxx release

GPU work items:      32768

Buffer size:         33554432

CPU workers:         1

Timing loops:        20

Repeats:             1

Kernel loops:        20

inputBuffer:         CL_MEM_READ_ONLYCL_MEM_USE_HOST_PTR

outputBuffer:        CL_MEM_WRITE_ONLYCL_MEM_USE_HOST_PTR

copyBuffer:          CL_MEM_READ_WRITECL_MEM_ALLOC_HOST_PTR

Host baseline (single thread, naive):

Timer resolution  301.861 ns

Page fault  506.466

Barrier speed  61.21581 ns

CPU read   12.917 GB/s

memcpy()   6.3788 GB/s

memset(,1,)   16.3768 GB/s

memset(,0,)   16.3993 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Mapping copyBuffer as mappedPtr

             clEnqueueMapBuffer:  0.000007 s [  5002.27 GB/s ]

2. Host CL write from mappedPtr to inputBuffer

           clEnqueueWriteBuffer:  0.004682 s       7.17 GB/s

3. GPU kernel read of inputBuffer

       clEnqueueNDRangeKernel():  0.548830 s       1.22 GB/s

                verification ok

4. GPU kernel write to outputBuffer

       clEnqueueNDRangeKernel():  0.863300 s       0.78 GB/s

5. Host CL read of outputBuffer to mappedPtr

            clEnqueueReadBuffer:  0.004887 s       6.87 GB/s

                 verification ok

6. Unmapping copyBuffer

      clEnqueueUnmapMemObject():  0.000039 s [   850.00 GB/s ]

Passed!

******************************************************************

Command line: D:\>bufferbandwidth -d 2 -t 3  -if 0 -of 1

******************************************************************

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Mapping copyBuffer as mappedPtr

             clEnqueueMapBuffer:  0.000011 s [  2929.59 GB/s ]

2. Host CL write from mappedPtr to inputBuffer

           clEnqueueWriteBuffer:  0.019592 s       1.71 GB/s

3. GPU kernel read of inputBuffer

       clEnqueueNDRangeKernel():  0.004592 s     146.13 GB/s

                verification ok

4. GPU kernel write to outputBuffer

       clEnqueueNDRangeKernel():  0.005429 s     123.62 GB/s

5. Host CL read of outputBuffer to mappedPtr

            clEnqueueReadBuffer:  0.019979 s       1.68 GB/s

                 verification ok

6. Unmapping copyBuffer

      clEnqueueUnmapMemObject():  0.000040 s [   836.50 GB/s ]

Passed!

******************************************************************

0 Kudos
Reply
tzachi_cohen
Staff
Staff

Re: Slow PCIe bus with single GPU in x16 slots, seeking cause

It seems the pci-e bus is under utilized in the second run. On my machine it runs with full pci-e speed.

I suggest you run an independent pci-e performance test and if the problem persist contact the motherboard distributor.

drallan
Challenger
Challenger

Re: Slow PCIe bus with single GPU in x16 slots, seeking cause

Wow, thanks for the fast reply.

Just to be clear, can you to post your numbers from the second test?

What OS/driver did you use?

Again, thanks.

0 Kudos
Reply
revisionfx
Journeyman III
Journeyman III

Re: Slow PCIe bus with single GPU in x16 slots, seeking cause

Here to report that we discovered a bug with some new Asus motherboard along that line

Here's a snapshot of Bios (Sandy Bridge Extreme 6 cores)

http://www.revisioneffects.com/bugreports/Bios.jpg

We have 3 tests computers, 2 have that motherboard with that bios, another also an Asus one but a different one (a 4 core SB).  It's what Computer Central dropped in our case for some lab testing.

And 4 7970 (2 XFF and 2 Sapphire).

That's what it needed to isolate the issue

Diagnostic:  RAM to GPU memory transfer is real slow (via openCL) on the Sandy Bridge 6 core based motherboard we have. Cards are fine as it works fine on other motherboard (in 2.0 PCI).  On that motherboard either on PCI 2 or 3 slot it's real slow.

Question 1: Anyone know how to get to tier 2 at Asus to report such problems? Not as easy as you think (aside getting a phone machine...)

I do see they upgraded the bios for their own 7970 but we tried that and it makes no difference.

Question 2:  I tried BufferBandwidth and it's not as clear as drallan, but looks like the inverse, the first run is the wacko looks like in our case. I am not too use to have to worry about PCI. What is an  independent pci-e performance test???

This is under Windows 7 with as far as know all the latest OpenCL stuff...

- pierre

(jasmin at revisionfx dot com)

0 Kudos
Reply
drallan
Challenger
Challenger

Re: Slow PCIe bus with single GPU in x16 slots, seeking cause

Don't know if it helps but I did learn more about my slow PCI problems.

It appears to be the motherboard design and relates to using PLX PCIe switches and NF200 chips to get more lanes to the PCIe cards.

My (Z68) board has 5 PCIe slots.

  1. One x16. When used, all the other 4 slots are turned off, you can use only 1 GPU, This slot works at full speed (x16 or 5.7GB/s)
  2. A group of 4 slots that are being switched by the NF200 and PLX chips. All of these slots run very slow no matter what the configuration, even 1 card in the fastest x16 slot.

My board sounds like your "good" board. All slow except one slot. Your two X79 boards are a surprise though, the X79 has about double the lane capacity, but may also have some design issues.

More than one motherboard manufacture has been a bit deceptive about PCIe performance. Several manuals caution to not use certain slot combinations for performance reasons, but their advertising leads you to think these combinations have the same performance.

Sorry, I don't know how to get the manufactures' attention on such problems. Some forums for specific brand boards sometimes have company reps that occasionally respond.

Good luck,

  drallan

0 Kudos
Reply
nou
Exemplar
Exemplar

Re: Slow PCIe bus with single GPU in x16 slots, seeking cause

this boards are designed for SLI/Crossfire setups where slower PCIe comunications don't introduce much slowdown in games. there are test where they artificaly cutted full 16x slot to 8x 4x 2x and even 1x. even 4x PCIe slot introduce only small slowdown (around 10-15%) in games.

0 Kudos
Reply
revisionfx
Journeyman III
Journeyman III

Re: Slow PCIe bus with single GPU in x16 slots, seeking cause

"It appears to be the motherboard design and relates to using PLX PCIe

switches and NF200 chips to get more lanes to the PCIe cards."

Thanks,

I see that Relee at 3/10/12 at 8:05pm relayed that he found a good hint

with a different motherboard - same problem

http://www.overclock.net/t/1223512/s-l-o-w-pcie-bus-with-single-gpu-6970-7970-in-any-slot-why

says he has the same problem as me basically with a different

motherboard, he says the bus refuses to switch to burst mode.

I just asked him if he found out more since.

I remember seeing a similar issue with an Nvidia driver for linux versus

on windows on same machine. I wonder if the AMD openCL driver has a role

in there?

Pierre

0 Kudos
Reply