Archives Discussions

willsong · ‎01-20-2014

Hi,

We are currently testing out what kind of bandwidth we can achieve in OpenCL from a multi-GPU setup. Our setup is Radeon HD 7990 (x 4) on dual CPU motherboard, SLES 11 sp2, AMD Catalyst driver v13.4 (beta) for Linux.

Through some testing, we have determined the following:

OpenCL runtime identifies 8 devices (0 to 7) - since the 7990 is a dual GPU
Device IDs 0 - 3 are "attached" to CPU 0
Device IDs 4 - 7 are "attached" to CPU 1

Our test simply transfers data from the host memory to the device memory. We use a single context for all devices, and separate command queues for each device. Each command queue is handled by a separate thread on the host side, i.e. the data is transferred to all devices concurrently.

Our tests show the following results:

Running the test on device IDs 0 and 7 only (i.e. attached to different CPUs) results in around 9.8 GB/s bandwidth on each device
- We think this is a reasonable value, since the BufferBandwidth test in the given AMD samples results in similar values
Running the test on device IDs 0 and 1 only (i.e. same physical GPU, sharing the PCIe slot) results in around 6.0 GB/s bandwidth on each device
- We think this is probably reasonable, as the dual GPU results in contention (is this a correct assumption?)
Running the test on device IDs 0 and 3 only (i.e. attached to the same CPU, but two different physical GPUs) results in around 6.3 GB/s bandwidth on each device
- Increasing the number of devices (e.g. running the test on device IDs 0, 1, 3) results in even slower bandwidth
- Since the GPUs do not share PCIe slots, we expected near full bandwidth from each device

We have the following questions:

Is our assumption correct - in that a dual GPU card will result in roughly half the data transfer bandwidth for its two devices when running concurrently?
Are our test results expected - GPUs attached to different CPUs can produce full bandwidth, but GPUs attached to the same CPU results in half the bandwidth? Is this a hardware (motherboard) issue?

Any advice/comments would be very much appreciated.

Thanks!

moozoo · ‎01-21-2014

Re number 2, what motherboard are you using? The motherboard manual should document the pcie speeds.

willsong · ‎01-21-2014

We are using the Supermicro X9DRG-QF. The manual states that it has 4x PCIe 3.0 x16 (double-width) slots. It does not state the speeds, but being v3.0, each slot should theoretically be capable of up to 15.75 GB/s.

Meteorhead · ‎01-25-2014

This seems like a runtime issue. I would check what happens when you create seperate contexts for the devices, whether that produces expected results, and maybe fiddle around with threading (create contexts in different threads and do everything in different threads...). If one of these things increase bandwidth to the expected rate, than the runtime cannot handle multiple devices in a single context properly being fed from a single thread.

willsong · ‎01-28-2014

Thanks for the suggestion. I have tried the following scenarios:

Single context for all devices (from the original post)
Multiple contexts (one for each device) - created in the main thread, and used in the subsequently created device threads
- Same results as the original post
Multiple contexts (one for each device) - created from the device threads
- Same results as the original post

So, not much progress with multiple contexts - perhaps suggesting that this is a hardware issue?

Meteorhead · ‎01-29-2014

It is either an inherent runtime issue, or the runtime is not doing somthing the way it should. (Similar as to how GPU_ASYNC_MEM_OBJECTS environmental vairable had to be used to achieve maximum bandwidths. I do no know the current situation of environmental hacks to alter runtime behavior.)

Archives Discussions

PCIe transfer bandwidth for multi-GPU