16 Replies Latest reply on Mar 5, 2012 7:33 PM by cadorino

    Cross-device bandwidth for discrete GPU (HD 5870)

    cadorino

      Hi,

      I'm testing a system equipped with a Fusion A8-3850 and an HD 5870 gpu. I was planning to test the memory access bandwidth in the following cases:

       

      1) The discrete GPU (HD 5870) reads from a buffer allocated in the host memory (CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY)

      2) The integrated GPU (6550D) reads from a buffer allocated in the host memory (CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY)

       

      I was assuming that the result of the first test (discrete gpu) would never be higher than the PCI-express bandwidth (approx 8GB/s), but I'm getting a bandwidth that is around 40 GB/s.

      I'm checking the bandwidth by using both the GlobalMemoryTest sample shipped with the AMD SDK and a program written by myself. The results are very similar.

       

      Can you explain me if it is (and why it is) possible to get a cross-domain (gpu->cpu) read bandwidth higher than the PCI one from a discrete GPU?.

       

      Thank you very much!

        • Re: Cross-device bandwidth for discrete GPU (HD 5870)
          cadorino

          I forgot to mention that reads are performed linearly (each thread reads a fixed-size memory range starting from its own global index).

            • Re: Cross-device bandwidth for discrete GPU (HD 5870)
              jeff_golds

              If you're testing in Linux, you cannot bind host memory directly to the GPU unless VM is enabled.  Currently, VM in Linux is only enabled for HD79xx GPUs.  If you're testing in Windows, then you can bind host memory directly to the GPU, but then your bandwidth figures don't add up.

                • Re: Cross-device bandwidth for discrete GPU (HD 5870)
                  cadorino

                  Sorry, I forgot to mention I'm working on a windows 7 machine...

                  • Re: Cross-device bandwidth for discrete GPU (HD 5870)
                    cadorino

                    Memory tests are verified after completion, therefore I'm sure the whole set of memory accesses are correctly performed by the gpu threads.
                    Moreover, I'm getting the same results by simply modifying the GlobalMemoryBandwidth test sample shipped with the SDK, adding the flag CL_MEM_ALLOC_HOST_PTR to the buffers used to test the GPU read bandwidth.
                    I'm really confused.

                      • Re: Cross-device bandwidth for discrete GPU (HD 5870)
                        jeff_golds

                        Is it possible to share your test case?  Do you only see this behavior with the HD5870 in the Fusion system?  I'll see if I can get someone to set up a test environment similar to yours.

                          • Re: Cross-device bandwidth for discrete GPU (HD 5870)
                            cadorino

                            I can share with you the visual studio solution of my benchmark. This is the fastest solution by the least suitable to be rapidly understood, since the code is highly parametric and somewhere obscure.

                            I try to set up a synthetic version of the test and I share it with you ASAP. It should take me half an hour.

                             

                            I can see three opencl devices in the system:

                             

                            1) HD 5870

                            2) AMD 6550D

                            3) AMD A8-3850 quad-core

                             

                            and I'm testing all of them. I'm performing quite a lot of benchmarks (CPU sequential and "native threads" memory access bandwidth with different allocation flags, transfer bandwidth for each opencl device in the system...) having the target of:

                             

                            1) Comparing the cross-domain bandwidth of the integrated GPU and of the discrete one

                            2) Testing the performance of accessing buffers allocated using different strategies (flags, mapping, etc.) both by the host and by the GPU to build up an automatic scheduler suitable for generic parallel computations on heterogeneous CPU-GPU systems

                             

                            Thank you very much for your support.

                            • Re: Cross-device bandwidth for discrete GPU (HD 5870)
                              cadorino

                              Hi,

                              I set up a more succinct test for the problem I encountered. To prevent to drag mistakes from the other program I wrote this sample from the beginning with no cut and paste. Moreover, I chose to move to gpu timers in place of windows query performance counters used to compute the bandwidth in the extended test case.

                              Unfortunately, I get the same results (now the bandwidth is higher, probably cause no host-code overhead is accounted by gpu timers).

                               

                              Here is the link to the source code:

                              Host code: http://www.gabrielecocco.it/fusion/SimpleMemoryTest.cpp

                              Kernel: http://www.gabrielecocco.it/fusion/memory_test.cl

                               

                              I chose to put everything inside a cpp file (in place of a VS solution or something like that), so you aren't forced to use visual studio or to any other IDE to compile and run it.
                              The most relevant configurations (number of reads per thread, sice of the buffer, flags, etc.) are encoded as MACROs at the beginning of the file.
                              I tried to put relevant comments to understand the code, sorry for any inconvenience in reading it.

                               

                              Thank you thousand for your help.

                               

                              Finally, here is the output of the test (150GB/s for the 5870, 42 GB/s for the 6550D, 14GB/s for the CPU)

                              C:\Users\gabriele\Desktop\CpuGpuTesting\Release>SimpleMemoryTest.exe

                              - Tested devices listed below

                                Cypress[GPU]

                                BeaverCreek[GPU]

                                AMD A8-3800 APU with Radeon(tm) HD Graphics[CPU]

                               

                              - Creating opencl environment for each tested device...

                                Getting platform id...             DONE!

                                Searching device (Cypress)...      DONE!

                                Creating context...                DONE!

                                Creating command queue...          DONE!

                                Loading kernel file...             DONE!

                                Creating program with source...    DONE!

                                Building program...                DONE!

                                  Creating kernel read_linear      DONE!

                               

                                Getting platform id...             DONE!

                                Searching device (BeaverCreek)...  DONE!

                                Creating context...                DONE!

                                Creating command queue...          DONE!

                                Loading kernel file...             DONE!

                                Creating program with source...    DONE!

                                Building program...                DONE!

                                  Creating kernel read_linear      DONE!

                               

                                Getting platform id...             DONE!

                                Searching device (AMD A8-3800 APU with Radeon(tm) HD Graphics)...DONE!

                                Creating context...                DONE!

                                Creating command queue...          DONE!

                                Loading kernel file...             DONE!

                                Creating program with source...    DONE!

                                Building program...                DONE!

                                  Creating kernel read_linear      DONE!

                               

                              - Testing Cypress [GPU] (16777216 bytes buffer, 32 reads per thread)

                              Estimated bandwidth: 151460.05 MB/s (success = 1)

                               

                              - Testing BeaverCreek [GPU] (16777216 bytes buffer, 32 reads per thread)

                              Estimated bandwidth: 42080.92 MB/s (success = 1)

                               

                              - Testing AMD A8-3800 APU with Radeon(tm) HD Graphics [CPU] (16777216 bytes buffer, 32 reads per thread)

                              Estimated bandwidth: 14809.57 MB/s (success = 1)

                               

                              - Test ended. Press a key to exit...

                              • Re: Cross-device bandwidth for discrete GPU (HD 5870)
                                cadorino

                                I've spent some time testing the same memory bandwidth program with the same gpu (HD 5870) on a different board and processore (intel i7). Unfortunately I get the same bandwidth results

                        • Re: Cross-device bandwidth for discrete GPU (HD 5870)
                          cadorino

                          Any news? Maybe the high bandwidth of the discrete card is due to some caching inside the device or in the command queue. In this case, how can I avoid it?

                            • Re: Cross-device bandwidth for discrete GPU (HD 5870)
                              siu

                              The access pattern (neighboring work items reading from overlapping memory regions) indicates that most of reads are probably hitting the data cache.  That could explain why the bandwidth is high because it isn't measuring the PCI-E bandwidth. 

                               

                              Have you looked at the BufferBandwidth sample in the SDK?  If you run that sample program with the input buffer set to ALLOC_HOST_PTR | READ_ONLY, the bandwidth of "GPU kernel read" is probably similar to what you are trying achieve with your test program.

                                • Re: Cross-device bandwidth for discrete GPU (HD 5870)
                                  cadorino

                                  Hi. The problem is that I get very similar results by accessing "randomly" to the host or device memory. Unfortunately, rabndomly means using a static offset, so a smart compiler could optimize prefetching also in this case. Is there any trick to measure the real bandwidth in transferring data from the host to the HD 5870?

                                    • Re: Cross-device bandwidth for discrete GPU (HD 5870)
                                      siu

                                      To measure the PCIe bandwidth, you can simply time clEnqueueReadBuffer and clEnqueueWriteBuffer. 

                                      Try running the BufferBandwidth sample with the -pcie flag and it will show the PCIe bandwidth for each direction.  If you refer to the source code, you'll see that it's actually timing the Read/Write buffer.

                                        • Re: Cross-device bandwidth for discrete GPU (HD 5870)
                                          jeff_golds

                                          If you use clEnqueueReadBuffer and clEnqueueWriteBuffer, you will pay the price for pinning on each transfer unless you use the prepinned path as documented in the APP SDK documentation.  This is also demonstrated in the BufferBandwidth sample code.

                                            • Re: Cross-device bandwidth for discrete GPU (HD 5870)
                                              cadorino

                                              Hey, thank you for the answers!

                                              I tested the BufferBandwidth sample.

                                              With these arguments I obtain respectively 28GB/s and 17GB/s for the HD 5870 and the integrated GPU. The bandwidth of the discrete card is lower than the bandwidth of the integrated GPU only for very few GPU wavefronts (nw).

                                               

                                              C:\Users\gabriele\Downloads\BufferBandwidth\BufferBandwidth\samples\opencl\bin\x86> .\BufferBandwidth.exe -d 0 -if 5

                                              -of 5 -nwk 1 -nr 5 -nl 5 -nw 8192

                                               

                                              Probably I'm wrong in thinking that it should be straightforward to verify the higher bandwidth of an integrated GPU...