Archives Discussions

cadorino · ‎02-09-2012

Hi,

I'm testing a system equipped with a Fusion A8-3850 and an HD 5870 gpu. I was planning to test the memory access bandwidth in the following cases:

1) The discrete GPU (HD 5870) reads from a buffer allocated in the host memory (CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY)

2) The integrated GPU (6550D) reads from a buffer allocated in the host memory (CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY)

I was assuming that the result of the first test (discrete gpu) would never be higher than the PCI-express bandwidth (approx 8GB/s), but I'm getting a bandwidth that is around 40 GB/s.

I'm checking the bandwidth by using both the GlobalMemoryTest sample shipped with the AMD SDK and a program written by myself. The results are very similar.

Can you explain me if it is (and why it is) possible to get a cross-domain (gpu->cpu) read bandwidth higher than the PCI one from a discrete GPU?.

Thank you very much!

cadorino · ‎02-09-2012

I forgot to mention that reads are performed linearly (each thread reads a fixed-size memory range starting from its own global index).

jeff_golds · ‎02-09-2012

If you're testing in Linux, you cannot bind host memory directly to the GPU unless VM is enabled. Currently, VM in Linux is only enabled for HD79xx GPUs. If you're testing in Windows, then you can bind host memory directly to the GPU, but then your bandwidth figures don't add up.

cadorino · ‎02-09-2012

Sorry, I forgot to mention I'm working on a windows 7 machine...

jeff_golds · ‎02-09-2012

Then the data doesn't make sense in that case. GlobalMemoryTest only tests memory on the GPU, so it wouldn't test PCIe bandwidth.

cadorino · ‎02-09-2012

Memory tests are verified after completion, therefore I'm sure the whole set of memory accesses are correctly performed by the gpu threads.
Moreover, I'm getting the same results by simply modifying the GlobalMemoryBandwidth test sample shipped with the SDK, adding the flag CL_MEM_ALLOC_HOST_PTR to the buffers used to test the GPU read bandwidth.
I'm really confused.

jeff_golds · ‎02-09-2012

Is it possible to share your test case? Do you only see this behavior with the HD5870 in the Fusion system? I'll see if I can get someone to set up a test environment similar to yours.

cadorino · ‎02-09-2012

I can share with you the visual studio solution of my benchmark. This is the fastest solution by the least suitable to be rapidly understood, since the code is highly parametric and somewhere obscure.

I try to set up a synthetic version of the test and I share it with you ASAP. It should take me half an hour.

I can see three opencl devices in the system:

1) HD 5870

2) AMD 6550D

3) AMD A8-3850 quad-core

and I'm testing all of them. I'm performing quite a lot of benchmarks (CPU sequential and "native threads" memory access bandwidth with different allocation flags, transfer bandwidth for each opencl device in the system...) having the target of:

1) Comparing the cross-domain bandwidth of the integrated GPU and of the discrete one

2) Testing the performance of accessing buffers allocated using different strategies (flags, mapping, etc.) both by the host and by the GPU to build up an automatic scheduler suitable for generic parallel computations on heterogeneous CPU-GPU systems

Thank you very much for your support.

cadorino · ‎02-09-2012

Hi,

I set up a more succinct test for the problem I encountered. To prevent to drag mistakes from the other program I wrote this sample from the beginning with no cut and paste. Moreover, I chose to move to gpu timers in place of windows query performance counters used to compute the bandwidth in the extended test case.

Unfortunately, I get the same results (now the bandwidth is higher, probably cause no host-code overhead is accounted by gpu timers).

Here is the link to the source code:

Host code: http://www.gabrielecocco.it/fusion/SimpleMemoryTest.cpp

Kernel: http://www.gabrielecocco.it/fusion/memory_test.cl

I chose to put everything inside a cpp file (in place of a VS solution or something like that), so you aren't forced to use visual studio or to any other IDE to compile and run it.
The most relevant configurations (number of reads per thread, sice of the buffer, flags, etc.) are encoded as MACROs at the beginning of the file.
I tried to put relevant comments to understand the code, sorry for any inconvenience in reading it.

Thank you thousand for your help.

Finally, here is the output of the test (150GB/s for the 5870, 42 GB/s for the 6550D, 14GB/s for the CPU)

C:\Users\gabriele\Desktop\CpuGpuTesting\Release>SimpleMemoryTest.exe

- Tested devices listed below

Cypress[GPU]

BeaverCreek[GPU]

AMD A8-3800 APU with Radeon(tm) HD Graphics[CPU]

- Creating opencl environment for each tested device...

Getting platform id... DONE!

Searching device (Cypress)... DONE!

Creating context... DONE!

Creating command queue... DONE!

Loading kernel file... DONE!

Creating program with source... DONE!

Building program... DONE!

Creating kernel read_linear DONE!

Getting platform id... DONE!

Searching device (BeaverCreek)... DONE!

Creating context... DONE!

Creating command queue... DONE!

Loading kernel file... DONE!

Creating program with source... DONE!

Building program... DONE!

Creating kernel read_linear DONE!

Getting platform id... DONE!

Searching device (AMD A8-3800 APU with Radeon(tm) HD Graphics)...DONE!

Creating context... DONE!

Creating command queue... DONE!

Loading kernel file... DONE!

Creating program with source... DONE!

Building program... DONE!

Creating kernel read_linear DONE!

- Testing Cypress [GPU] (16777216 bytes buffer, 32 reads per thread)

Estimated bandwidth: 151460.05 MB/s (success = 1)

- Testing BeaverCreek [GPU] (16777216 bytes buffer, 32 reads per thread)

Estimated bandwidth: 42080.92 MB/s (success = 1)

- Testing AMD A8-3800 APU with Radeon(tm) HD Graphics [CPU] (16777216 bytes buffer, 32 reads per thread)

Estimated bandwidth: 14809.57 MB/s (success = 1)

- Test ended. Press a key to exit...

cadorino · ‎02-10-2012

Did you have any time to take a look to the code?

cadorino · ‎02-12-2012

I've spent some time testing the same memory bandwidth program with the same gpu (HD 5870) on a different board and processore (intel i7). Unfortunately I get the same bandwidth results

cadorino · ‎02-25-2012

Any news? Maybe the high bandwidth of the discrete card is due to some caching inside the device or in the command queue. In this case, how can I avoid it?

siu · ‎02-29-2012

The access pattern (neighboring work items reading from overlapping memory regions) indicates that most of reads are probably hitting the data cache. That could explain why the bandwidth is high because it isn't measuring the PCI-E bandwidth.

Have you looked at the BufferBandwidth sample in the SDK? If you run that sample program with the input buffer set to ALLOC_HOST_PTR | READ_ONLY, the bandwidth of "GPU kernel read" is probably similar to what you are trying achieve with your test program.

cadorino · ‎03-05-2012

Hi. The problem is that I get very similar results by accessing "randomly" to the host or device memory. Unfortunately, rabndomly means using a static offset, so a smart compiler could optimize prefetching also in this case. Is there any trick to measure the real bandwidth in transferring data from the host to the HD 5870?

siu · ‎03-05-2012

To measure the PCIe bandwidth, you can simply time clEnqueueReadBuffer and clEnqueueWriteBuffer.

Try running the BufferBandwidth sample with the -pcie flag and it will show the PCIe bandwidth for each direction. If you refer to the source code, you'll see that it's actually timing the Read/Write buffer.

jeff_golds · ‎03-05-2012

If you use clEnqueueReadBuffer and clEnqueueWriteBuffer, you will pay the price for pinning on each transfer unless you use the prepinned path as documented in the APP SDK documentation. This is also demonstrated in the BufferBandwidth sample code.

cadorino · ‎03-05-2012

Hey, thank you for the answers!

I tested the BufferBandwidth sample.

With these arguments I obtain respectively 28GB/s and 17GB/s for the HD 5870 and the integrated GPU. The bandwidth of the discrete card is lower than the bandwidth of the integrated GPU only for very few GPU wavefronts (nw).

C:\Users\gabriele\Downloads\BufferBandwidth\BufferBandwidth\samples\opencl\bin\x86> .\BufferBandwidth.exe -d 0 -if 5

-of 5 -nwk 1 -nr 5 -nl 5 -nw 8192

Probably I'm wrong in thinking that it should be straightforward to verify the higher bandwidth of an integrated GPU...

Archives Discussions

Cross-device bandwidth for discrete GPU (HD 5870)