5 Replies Latest reply on Jan 29, 2014 6:01 AM by Meteorhead

    PCIe transfer bandwidth for multi-GPU

    willsong

      Hi,

       

      We are currently testing out what kind of bandwidth we can achieve in OpenCL from a multi-GPU setup.  Our setup is Radeon HD 7990 (x 4) on dual CPU motherboard, SLES 11 sp2, AMD Catalyst driver v13.4 (beta) for Linux.

       

      Through some testing, we have determined the following:

       

      • OpenCL runtime identifies 8 devices (0 to 7) - since the 7990 is a dual GPU
      • Device IDs 0 - 3 are "attached" to CPU 0
      • Device IDs 4 - 7 are "attached" to CPU 1

       

      Our test simply transfers data from the host memory to the device memory.  We use a single context for all devices, and separate command queues for each device.  Each command queue is handled by a separate thread on the host side, i.e. the data is transferred to all devices concurrently.

       

      Our tests show the following results:

       

      • Running the test on device IDs 0 and 7 only (i.e. attached to different CPUs) results in around 9.8 GB/s bandwidth on each device
        • We think this is a reasonable value, since the BufferBandwidth test in the given AMD samples results in similar values
      • Running the test on device IDs 0 and 1 only (i.e. same physical GPU, sharing the PCIe slot) results in around 6.0 GB/s bandwidth on each device
        • We think this is probably reasonable, as the dual GPU results in contention (is this a correct assumption?)
      • Running the test on device IDs 0 and 3 only (i.e. attached to the same CPU, but two different physical GPUs) results in around 6.3 GB/s bandwidth on each device
        • Increasing the number of devices (e.g. running the test on device IDs 0, 1, 3) results in even slower bandwidth
        • Since the GPUs do not share PCIe slots, we expected near full bandwidth from each device

       

      We have the following questions:

       

      1. Is our assumption correct - in that a dual GPU card will result in roughly half the data transfer bandwidth for its two devices when running concurrently?
      2. Are our test results expected - GPUs attached to different CPUs can produce full bandwidth, but GPUs attached to the same CPU results in half the bandwidth?  Is this a hardware (motherboard) issue?

       

      Any advice/comments would be very much appreciated.

      Thanks!

        • Re: PCIe transfer bandwidth for multi-GPU
          moozoo

          Re number 2, what motherboard are you using? The motherboard manual should document the pcie speeds.

          • Re: PCIe transfer bandwidth for multi-GPU
            Meteorhead

            This seems like a runtime issue. I would check what happens when you create seperate contexts for the devices, whether that produces expected results, and maybe fiddle around with threading (create contexts in different threads and do everything in different threads...). If one of these things increase bandwidth to the expected rate, than the runtime cannot handle multiple devices in a single context properly being fed from a single thread.

              • Re: PCIe transfer bandwidth for multi-GPU
                willsong

                Thanks for the suggestion.  I have tried the following scenarios:

                 

                • Single context for all devices (from the original post)
                • Multiple contexts (one for each device) - created in the main thread, and used in the subsequently created device threads
                  • Same results as the original post
                • Multiple contexts (one for each device) - created from the device threads
                  • Same results as the original post

                 

                So, not much progress with multiple contexts - perhaps suggesting that this is a hardware issue?