cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

cadorino
Journeyman III
Journeyman III

Cross-device bandwidth for discrete GPU (HD 5870)

Hi,

I'm testing a system equipped with a Fusion A8-3850 and an HD 5870 gpu. I was planning to test the memory access bandwidth in the following cases:

1) The discrete GPU (HD 5870) reads from a buffer allocated in the host memory (CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY)

2) The integrated GPU (6550D) reads from a buffer allocated in the host memory (CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY)

I was assuming that the result of the first test (discrete gpu) would never be higher than the PCI-express bandwidth (approx 8GB/s), but I'm getting a bandwidth that is around 40 GB/s.

I'm checking the bandwidth by using both the GlobalMemoryTest sample shipped with the AMD SDK and a program written by myself. The results are very similar.

Can you explain me if it is (and why it is) possible to get a cross-domain (gpu->cpu) read bandwidth higher than the PCI one from a discrete GPU?.

Thank you very much!

0 Kudos
Reply
16 Replies
cadorino
Journeyman III
Journeyman III

Re: Cross-device bandwidth for discrete GPU (HD 5870)

I forgot to mention that reads are performed linearly (each thread reads a fixed-size memory range starting from its own global index).

0 Kudos
Reply
jeff_golds
Staff
Staff

Re: Cross-device bandwidth for discrete GPU (HD 5870)

If you're testing in Linux, you cannot bind host memory directly to the GPU unless VM is enabled.  Currently, VM in Linux is only enabled for HD79xx GPUs.  If you're testing in Windows, then you can bind host memory directly to the GPU, but then your bandwidth figures don't add up.

0 Kudos
Reply
cadorino
Journeyman III
Journeyman III

Re: Cross-device bandwidth for discrete GPU (HD 5870)

Sorry, I forgot to mention I'm working on a windows 7 machine...

0 Kudos
Reply
jeff_golds
Staff
Staff

Re: Cross-device bandwidth for discrete GPU (HD 5870)

Then the data doesn't make sense in that case.  GlobalMemoryTest only tests memory on the GPU, so it wouldn't test PCIe bandwidth.

0 Kudos
Reply
cadorino
Journeyman III
Journeyman III

Re: Cross-device bandwidth for discrete GPU (HD 5870)

Memory tests are verified after completion, therefore I'm sure the whole set of memory accesses are correctly performed by the gpu threads.
Moreover, I'm getting the same results by simply modifying the GlobalMemoryBandwidth test sample shipped with the SDK, adding the flag CL_MEM_ALLOC_HOST_PTR to the buffers used to test the GPU read bandwidth.
I'm really confused.

0 Kudos
Reply
jeff_golds
Staff
Staff

Re: Cross-device bandwidth for discrete GPU (HD 5870)

Is it possible to share your test case?  Do you only see this behavior with the HD5870 in the Fusion system?  I'll see if I can get someone to set up a test environment similar to yours.

0 Kudos
Reply
cadorino
Journeyman III
Journeyman III

Re: Cross-device bandwidth for discrete GPU (HD 5870)

I can share with you the visual studio solution of my benchmark. This is the fastest solution by the least suitable to be rapidly understood, since the code is highly parametric and somewhere obscure.

I try to set up a synthetic version of the test and I share it with you ASAP. It should take me half an hour.

I can see three opencl devices in the system:

1) HD 5870

2) AMD 6550D

3) AMD A8-3850 quad-core

and I'm testing all of them. I'm performing quite a lot of benchmarks (CPU sequential and "native threads" memory access bandwidth with different allocation flags, transfer bandwidth for each opencl device in the system...) having the target of:

1) Comparing the cross-domain bandwidth of the integrated GPU and of the discrete one

2) Testing the performance of accessing buffers allocated using different strategies (flags, mapping, etc.) both by the host and by the GPU to build up an automatic scheduler suitable for generic parallel computations on heterogeneous CPU-GPU systems

Thank you very much for your support.

0 Kudos
Reply
cadorino
Journeyman III
Journeyman III

Re: Cross-device bandwidth for discrete GPU (HD 5870)

Hi,

I set up a more succinct test for the problem I encountered. To prevent to drag mistakes from the other program I wrote this sample from the beginning with no cut and paste. Moreover, I chose to move to gpu timers in place of windows query performance counters used to compute the bandwidth in the extended test case.

Unfortunately, I get the same results (now the bandwidth is higher, probably cause no host-code overhead is accounted by gpu timers).

Here is the link to the source code:

Host code: http://www.gabrielecocco.it/fusion/SimpleMemoryTest.cpp

Kernel: http://www.gabrielecocco.it/fusion/memory_test.cl

I chose to put everything inside a cpp file (in place of a VS solution or something like that), so you aren't forced to use visual studio or to any other IDE to compile and run it.
The most relevant configurations (number of reads per thread, sice of the buffer, flags, etc.) are encoded as MACROs at the beginning of the file.
I tried to put relevant comments to understand the code, sorry for any inconvenience in reading it.

Thank you thousand for your help.

Finally, here is the output of the test (150GB/s for the 5870, 42 GB/s for the 6550D, 14GB/s for the CPU)

C:\Users\gabriele\Desktop\CpuGpuTesting\Release>SimpleMemoryTest.exe

- Tested devices listed below

  Cypress[GPU]

  BeaverCreek[GPU]

  AMD A8-3800 APU with Radeon(tm) HD Graphics[CPU]

- Creating opencl environment for each tested device...

  Getting platform id...             DONE!

  Searching device (Cypress)...      DONE!

  Creating context...                DONE!

  Creating command queue...          DONE!

  Loading kernel file...             DONE!

  Creating program with source...    DONE!

  Building program...                DONE!

    Creating kernel read_linear      DONE!

  Getting platform id...             DONE!

  Searching device (BeaverCreek)...  DONE!

  Creating context...                DONE!

  Creating command queue...          DONE!

  Loading kernel file...             DONE!

  Creating program with source...    DONE!

  Building program...                DONE!

    Creating kernel read_linear      DONE!

  Getting platform id...             DONE!

  Searching device (AMD A8-3800 APU with Radeon(tm) HD Graphics)...DONE!

  Creating context...                DONE!

  Creating command queue...          DONE!

  Loading kernel file...             DONE!

  Creating program with source...    DONE!

  Building program...                DONE!

    Creating kernel read_linear      DONE!

- Testing Cypress [GPU] (16777216 bytes buffer, 32 reads per thread)

Estimated bandwidth: 151460.05 MB/s (success = 1)

- Testing BeaverCreek [GPU] (16777216 bytes buffer, 32 reads per thread)

Estimated bandwidth: 42080.92 MB/s (success = 1)

- Testing AMD A8-3800 APU with Radeon(tm) HD Graphics [CPU] (16777216 bytes buffer, 32 reads per thread)

Estimated bandwidth: 14809.57 MB/s (success = 1)

- Test ended. Press a key to exit...

0 Kudos
Reply
cadorino
Journeyman III
Journeyman III

Re: Cross-device bandwidth for discrete GPU (HD 5870)

Did you have any time to take a look to the code?

0 Kudos
Reply