13 Replies Latest reply on Sep 7, 2012 4:02 AM by rocky67

    Slow PCIe bus with single GPU in x16 slots, seeking cause

    drallan

      PCIE performance becomes important when running multiple high end cards like the new 7970, which when over clocked, runs at almost 5 teraflops.

       

      Trying to optimize a kernel, I discovered that my PCIE bus is limited to 1.6GB/s, read and write, where it should be about 5-6 GB/s in a v2.0 x16 slot. I've tried several GPUs, one at a time, in every slot and always get the same numbers. I also updated and main board bios and drivers, AMD drivers, and tried every BIOS configuration, the whole works.

       

      I get identical numbers from programs like PCIeBandwidth, PCIspeedtest(v0.2), and my own code using all the suggested methods from the AMD APP Opencl Programming Guide (Dec 2011) for fast path transfers (a good read). The numbers I get are:

       

      PCIe x4 slot, transfer rate=1.40 GB/s (read and write) (one card)

      PCIe x16at x8, transfer rate=1.65 GB/s (read and write) (requires 2 cards)

      PCIe x16 at x16, transfer rate=1.65 GB/s (read and write) (one card)

       

      Also note the 1.40GB/s rate for the x4 slot is correct, extrapolated to x16 it would be 5.6GB/s. The x16 slots are faster but not by much. According to GPU-z, the x16 slot is running at x16 v2.0 pcie mode.

      PCI problems can be due to a combination of factors, but I doubt a pure hardware problem because I've tried 6870, 6970, and 7970 GPUs, and because I'm using a new top end main board specifically designed for high PCIE performance with 3 way Crossfire (ASRock z68-Extreme7), with 5 PCIE slots, one for PCIE v3.0, 3 for multi GPUs (x16, x16, 0) or (x16, x8, x8) and a dedicated x4 slot. It also uses PLX PEX8608 and NL200 chips to increase PCI lanes.

       

      I'm currently using the new 12.3 drivers (dated Feb 16, 2012).

       

      I've worked with PCIE before and know how complex these problems can be. Any help or feedback is greatly appreciated, particularly recent measurements of PCIE bus performance. Any help from AMD devgurus is also welcome (of course).

       

      I will add anything useful that I learn in this thread.

       

      Many thanks.

       

      Other info:

      Crossfire is not connected or selected.

      GPU-z says the cards are running x16 v2.0 when in x16 slots.

      GPU-z reports 7970s as x16 V3.0 running at x16-v1.1. When the GPU is loaded it switches to run at x16-v2.0. This does not affect the PCIE low performance problem.

        • Re: Slow PCIe bus with single GPU in x16 slots, seeking cause
          drallan

          Update:1

           

          Using CL_MEM_USE_PERSISTENT_MEM_AMD to copy directly to an on-device buffer will transfer the data a full bandwidth (5.6GB/s). Thus, the problem is not likely to be hardware.

           

          All other recommended options that promise peak interconnect bandwidth by using pinned or pre-pinned memory are slow (1.6GB/s),  the same speed as non-pinned memory.

           

          The question is then, why does pinned memory not work at full speed, is the memory really pinned?

           

          The following requirements for pinned memory were followed-

          1. All buffers are aligned to 4BK boundaries

          2. The buffers are not used as kernel arguments when transferring/mapping.

           

          Again, any input is appreciated.

           

          drallan

            • Re: Slow PCIe bus with single GPU in x16 slots, seeking cause
              tzachi.cohen

              To have a common language please run the AMD SDK sample 'BufferBandwidth' with the arguments '-t 3 -if 3 -of 3'  and post the console output.

              It performs pre-pinned memory transfer operations and it will help us to better understand the issue.

              What OS are you using?

                • Re: Slow PCIe bus with single GPU in x16 slots, seeking cause
                  drallan

                  Hello tzachi.cohen, thanks for responding. I am using Windows 7 x_64 service pack 1

                   

                  Below I show output for two tests

                  1. BufferBandwidth -t 3  -if 3 -of 3

                  2. BufferBandwidth  -t 3  -if 0 -of 1

                   

                  My understanding is that for case 2, EnqueueRead/WriteBuffer write to the GPU before the kernel is enqueued.

                   

                  drallan

                   

                  Microsoft Windows [Version 6.1.7601]

                  ******************************************************************

                  Command line: D:\>bufferbandwidth -d 2 -t 3  -if 3 -of 3

                  ******************************************************************

                  PCIE slot           1 [x16 v3.0] running at [x16 v2.0]

                  Device 2            Tahiti

                  Build:               _WINxx release

                  GPU work items:      32768

                  Buffer size:         33554432

                  CPU workers:         1

                  Timing loops:        20

                  Repeats:             1

                  Kernel loops:        20

                  inputBuffer:         CL_MEM_READ_ONLYCL_MEM_USE_HOST_PTR

                  outputBuffer:        CL_MEM_WRITE_ONLYCL_MEM_USE_HOST_PTR

                  copyBuffer:          CL_MEM_READ_WRITECL_MEM_ALLOC_HOST_PTR

                   

                  Host baseline (single thread, naive):

                  Timer resolution  301.861 ns

                  Page fault  506.466

                  Barrier speed  61.21581 ns

                  CPU read   12.917 GB/s

                  memcpy()   6.3788 GB/s

                  memset(,1,)   16.3768 GB/s

                  memset(,0,)   16.3993 GB/s

                   

                  AVERAGES (over loops 2 - 19, use -l for complete log)

                  --------

                  1. Mapping copyBuffer as mappedPtr

                               clEnqueueMapBuffer:  0.000007 s [  5002.27 GB/s ]

                  2. Host CL write from mappedPtr to inputBuffer

                             clEnqueueWriteBuffer:  0.004682 s       7.17 GB/s

                  3. GPU kernel read of inputBuffer

                         clEnqueueNDRangeKernel():  0.548830 s       1.22 GB/s

                                  verification ok

                   

                  4. GPU kernel write to outputBuffer

                         clEnqueueNDRangeKernel():  0.863300 s       0.78 GB/s

                  5. Host CL read of outputBuffer to mappedPtr

                              clEnqueueReadBuffer:  0.004887 s       6.87 GB/s

                                   verification ok

                  6. Unmapping copyBuffer

                        clEnqueueUnmapMemObject():  0.000039 s [   850.00 GB/s ]

                  Passed!

                   

                  ******************************************************************

                  Command line: D:\>bufferbandwidth -d 2 -t 3  -if 0 -of 1

                  ******************************************************************

                  AVERAGES (over loops 2 - 19, use -l for complete log)

                  --------

                  1. Mapping copyBuffer as mappedPtr

                               clEnqueueMapBuffer:  0.000011 s [  2929.59 GB/s ]

                  2. Host CL write from mappedPtr to inputBuffer

                             clEnqueueWriteBuffer:  0.019592 s       1.71 GB/s

                  3. GPU kernel read of inputBuffer

                         clEnqueueNDRangeKernel():  0.004592 s     146.13 GB/s

                                  verification ok

                   

                  4. GPU kernel write to outputBuffer

                         clEnqueueNDRangeKernel():  0.005429 s     123.62 GB/s

                  5. Host CL read of outputBuffer to mappedPtr

                              clEnqueueReadBuffer:  0.019979 s       1.68 GB/s

                                   verification ok

                  6. Unmapping copyBuffer

                        clEnqueueUnmapMemObject():  0.000040 s [   836.50 GB/s ]

                  Passed!

                  ******************************************************************

              • Re: Slow PCIe bus with single GPU in x16 slots, seeking cause
                revisionfx

                Here to report that we discovered a bug with some new Asus motherboard along that line

                 

                Here's a snapshot of Bios (Sandy Bridge Extreme 6 cores)

                http://www.revisioneffects.com/bugreports/Bios.jpg

                 

                We have 3 tests computers, 2 have that motherboard with that bios, another also an Asus one but a different one (a 4 core SB).  It's what Computer Central dropped in our case for some lab testing.

                And 4 7970 (2 XFF and 2 Sapphire).

                That's what it needed to isolate the issue

                 

                Diagnostic:  RAM to GPU memory transfer is real slow (via openCL) on the Sandy Bridge 6 core based motherboard we have. Cards are fine as it works fine on other motherboard (in 2.0 PCI).  On that motherboard either on PCI 2 or 3 slot it's real slow.

                 

                Question 1: Anyone know how to get to tier 2 at Asus to report such problems? Not as easy as you think (aside getting a phone machine...)

                 

                I do see they upgraded the bios for their own 7970 but we tried that and it makes no difference.

                 

                Question 2:  I tried BufferBandwidth and it's not as clear as drallan, but looks like the inverse, the first run is the wacko looks like in our case. I am not too use to have to worry about PCI. What is an  independent pci-e performance test???

                 

                This is under Windows 7 with as far as know all the latest OpenCL stuff...

                 

                - pierre

                (jasmin at revisionfx dot com)

                  • Re: Slow PCIe bus with single GPU in x16 slots, seeking cause
                    drallan

                    Don't know if it helps but I did learn more about my slow PCI problems.

                     

                    It appears to be the motherboard design and relates to using PLX PCIe switches and NF200 chips to get more lanes to the PCIe cards.

                     

                    My (Z68) board has 5 PCIe slots.

                    1. One x16. When used, all the other 4 slots are turned off, you can use only 1 GPU, This slot works at full speed (x16 or 5.7GB/s)
                    2. A group of 4 slots that are being switched by the NF200 and PLX chips. All of these slots run very slow no matter what the configuration, even 1 card in the fastest x16 slot.

                     

                    My board sounds like your "good" board. All slow except one slot. Your two X79 boards are a surprise though, the X79 has about double the lane capacity, but may also have some design issues.

                     

                    More than one motherboard manufacture has been a bit deceptive about PCIe performance. Several manuals caution to not use certain slot combinations for performance reasons, but their advertising leads you to think these combinations have the same performance.

                     

                    Sorry, I don't know how to get the manufactures' attention on such problems. Some forums for specific brand boards sometimes have company reps that occasionally respond.

                     

                    Good luck,

                     

                      drallan

                  • Re: Slow PCIe bus with single GPU in x16 slots, seeking cause
                    rocky67

                    Please advise wht is the best option when there are the same numbers being shown in different programs such as PCI Speedtest and PCUeBandwidth because I am stuck up