cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

willsong
Journeyman III

PCIe transfer bandwidth for multi-GPU

Hi,

We are currently testing out what kind of bandwidth we can achieve in OpenCL from a multi-GPU setup.  Our setup is Radeon HD 7990 (x 4) on dual CPU motherboard, SLES 11 sp2, AMD Catalyst driver v13.4 (beta) for Linux.

Through some testing, we have determined the following:

  • OpenCL runtime identifies 8 devices (0 to 7) - since the 7990 is a dual GPU
  • Device IDs 0 - 3 are "attached" to CPU 0
  • Device IDs 4 - 7 are "attached" to CPU 1

Our test simply transfers data from the host memory to the device memory.  We use a single context for all devices, and separate command queues for each device.  Each command queue is handled by a separate thread on the host side, i.e. the data is transferred to all devices concurrently.

Our tests show the following results:

  • Running the test on device IDs 0 and 7 only (i.e. attached to different CPUs) results in around 9.8 GB/s bandwidth on each device
    • We think this is a reasonable value, since the BufferBandwidth test in the given AMD samples results in similar values
  • Running the test on device IDs 0 and 1 only (i.e. same physical GPU, sharing the PCIe slot) results in around 6.0 GB/s bandwidth on each device
    • We think this is probably reasonable, as the dual GPU results in contention (is this a correct assumption?)
  • Running the test on device IDs 0 and 3 only (i.e. attached to the same CPU, but two different physical GPUs) results in around 6.3 GB/s bandwidth on each device
    • Increasing the number of devices (e.g. running the test on device IDs 0, 1, 3) results in even slower bandwidth
    • Since the GPUs do not share PCIe slots, we expected near full bandwidth from each device

We have the following questions:

  1. Is our assumption correct - in that a dual GPU card will result in roughly half the data transfer bandwidth for its two devices when running concurrently?
  2. Are our test results expected - GPUs attached to different CPUs can produce full bandwidth, but GPUs attached to the same CPU results in half the bandwidth?  Is this a hardware (motherboard) issue?

Any advice/comments would be very much appreciated.

Thanks!

0 Likes
5 Replies
moozoo
Adept III

Re number 2, what motherboard are you using? The motherboard manual should document the pcie speeds.

0 Likes

We are using the Supermicro X9DRG-QF.  The manual states that it has 4x PCIe 3.0 x16 (double-width) slots.  It does not state the speeds, but being v3.0, each slot should theoretically be capable of up to 15.75 GB/s.

0 Likes
Meteorhead
Challenger

This seems like a runtime issue. I would check what happens when you create seperate contexts for the devices, whether that produces expected results, and maybe fiddle around with threading (create contexts in different threads and do everything in different threads...). If one of these things increase bandwidth to the expected rate, than the runtime cannot handle multiple devices in a single context properly being fed from a single thread.

0 Likes

Thanks for the suggestion.  I have tried the following scenarios:

  • Single context for all devices (from the original post)
  • Multiple contexts (one for each device) - created in the main thread, and used in the subsequently created device threads
    • Same results as the original post
  • Multiple contexts (one for each device) - created from the device threads
    • Same results as the original post

So, not much progress with multiple contexts - perhaps suggesting that this is a hardware issue?

0 Likes

It is either an inherent runtime issue, or the runtime is not doing somthing the way it should. (Similar as to how GPU_ASYNC_MEM_OBJECTS environmental vairable had to be used to achieve maximum bandwidths. I do no know the current situation of environmental hacks to alter runtime behavior.)

0 Likes