Hi,
We are currently testing out what kind of bandwidth we can achieve in OpenCL from a multi-GPU setup. Our setup is Radeon HD 7990 (x 4) on dual CPU motherboard, SLES 11 sp2, AMD Catalyst driver v13.4 (beta) for Linux.
Through some testing, we have determined the following:
Our test simply transfers data from the host memory to the device memory. We use a single context for all devices, and separate command queues for each device. Each command queue is handled by a separate thread on the host side, i.e. the data is transferred to all devices concurrently.
Our tests show the following results:
We have the following questions:
Any advice/comments would be very much appreciated.
Thanks!
Re number 2, what motherboard are you using? The motherboard manual should document the pcie speeds.
We are using the Supermicro X9DRG-QF. The manual states that it has 4x PCIe 3.0 x16 (double-width) slots. It does not state the speeds, but being v3.0, each slot should theoretically be capable of up to 15.75 GB/s.
This seems like a runtime issue. I would check what happens when you create seperate contexts for the devices, whether that produces expected results, and maybe fiddle around with threading (create contexts in different threads and do everything in different threads...). If one of these things increase bandwidth to the expected rate, than the runtime cannot handle multiple devices in a single context properly being fed from a single thread.
Thanks for the suggestion. I have tried the following scenarios:
So, not much progress with multiple contexts - perhaps suggesting that this is a hardware issue?
It is either an inherent runtime issue, or the runtime is not doing somthing the way it should. (Similar as to how GPU_ASYNC_MEM_OBJECTS environmental vairable had to be used to achieve maximum bandwidths. I do no know the current situation of environmental hacks to alter runtime behavior.)