15 Replies Latest reply on May 17, 2013 2:04 AM by himanshu.gautam

    Max Read Bandwidth of Trinity APU

    dertomas

      Hi,

       

      I have got a bandwidth question.

       

      I got an A10-5800k Trinity APU and I use OpenCL with the integrated GPU Radeon HD 7660D.

      Also, I use the Asus F2 A85-M Pro motherboard and 4x 8GB 1866 MHz ram. OS is Windows 7.

       

      I used the AMD BufferBandwidth OpenCL example and ran it on my GPU.

      There I got about 25 GB/s for reading. All fine.

       

      Then I changed the BufferBandwidth code to use not nThreads (some calculated number) but maxThreads (maximal threads for the input data). I also changed the input data from 32 MB to 128 MB.

       

      Now I get 70 to 80 GB/s for reading. This sounds somehow too much.

       

      My first theoretical calculations  were:

       

      1.866 GT/s (Mem freq) * 256 bit (Radeon Memory Bus) * 2 (Dual Channel)  = 111 GB/s

       

      There, 70 to 80 GB/s sound plausible. But normal memory only has 64 bit wide buses. With Dual Channel, that would be only 128 bit meaning only 55 GB/s.

       

      Can anyone clarify the used buses for me, please? Do you really have 80 GB/s on an integrated CPU or is there a bug in the AMD OpenCL BufferBandwidth test? Are there some resources to do some further reading?

       

      Thanks a lot!

        • Re: Max Read Bandwidth of Trinity APU
          himanshu.gautam

          Can you publish the buffer bandwidth output here?

            • Re: Max Read Bandwidth of Trinity APU
              dertomas

              >BufferBandwidth.exe -nb 134217728

              Platform found : Advanced Micro Devices, Inc.

               

               

              Device  0            Devastator

              Build:               _WINxx release

              GPU work items:      8388608

              Buffer size:         134217728

              CPU workers:         1

              Timing loops:        20

              Repeats:             1

              Kernel loops:        20

              inputBuffer:         CL_MEM_READ_ONLY

              outputBuffer:        CL_MEM_WRITE_ONLY

               

              Host baseline (naive):

               

              Timer resolution     269.93  ns

              Page fault           856.44  ns

              Barrier speed        139.41  ns

               

              CPU read             4.58 GB/s

              memcpy()             4.93 GB/s

              memset(,1,)          7.62 GB/s

              memset(,0,)          7.59 GB/s

               

               

              AVERAGES (over loops 2 - 19, use -l for complete log)

              --------

               

              1. Host mapped write to inputBuffer

               

                    clEnqueueMapBuffer(WRITE):  0.019561 s [     6.86 GB/s ]

                                     memset():  0.017087 s       7.85 GB/s

                    clEnqueueUnmapMemObject():  0.020841 s [     6.44 GB/s ]

               

              2. GPU kernel read of inputBuffer

               

                     clEnqueueNDRangeKernel():  0.038843 s      69.11 GB/s

                               verification ok

               

              3. GPU kernel write to outputBuffer

               

                     clEnqueueNDRangeKernel():  0.085660 s      31.34 GB/s

               

              4. Host mapped read of outputBuffer

               

                     clEnqueueMapBuffer(READ):  0.018308 s [     7.33 GB/s ]

                                     CPU read:  0.029805 s       4.50 GB/s

                               verification ok

                    clEnqueueUnmapMemObject():  0.000079 s [  1702.97 GB/s ]

               

               

              Passed!

                • Re: Max Read Bandwidth of Trinity APU
                  himanshu.gautam

                  Just few cents of knowledge sharing:

                  As far as CPU main-memory bandwidth goes, the theoretical bandwidth cannot be reached by 1 thread.

                  Multiple threads may be needed depending on how many memory-links are there.

                  If you have 2 memory links then you need to load both the memory-links ; you also need to have 2 DIMMs minimum to realize that bandwidth.

                   

                  On APUs, the system memory is shared between CPU and GPU. You can find some good explanations in the URL slide below.

                  Check this out: http://amddevcentral.com/afds/assets/presentations/1004_final.pdf


                  As far as bloated numbers, You say you have changed the BufferBandwidth() code. Have you made sure you have changed all relevant places? Can you post the modified sample?

                    • Re: Max Read Bandwidth of Trinity APU
                      dertomas

                      Hi,

                      I tested how much I need to change to see these high bandwidths. Turns out, I need only one additional line.

                      I added this line:

                      nThreads = maxThreads;

                       

                      in BufferBandwidth.cpp at line 1283. That's it. When I run the program with the standard configuration (no arguments), I get around 70 GB/s read bandwidth.

                       

                      Please, find the code in the attachment. Again: I tested it on Windows 7 with the A10-5800k APU.

                        • Re: Max Read Bandwidth of Trinity APU
                          himanshu.gautam

                          Thanks for posting. Will look into it.

                          • Re: Max Read Bandwidth of Trinity APU
                            himanshu.gautam

                            I do not see any drastic results by using your modified BufferBandwidth.cpp file.  Hope you have just sent the other files for completeness purpose and not modified them. Although I did this test on a DGPU.

                            Can you share your bufferBandwidth sample output, when you see 70GBps speed. Also mention the cmdline options specified.

                             

                            As per the http://amddevcentral.com/afds/assets/presentations/1004_final.pdf , the transfer bandwidths should not be more than 13GBPs. Although it might have improved to some extent for trinity.

                              • Re: Max Read Bandwidth of Trinity APU
                                dertomas

                                Yes, I sent the other files only for completeness. The only change I did was the one line, as written before. Maybe it would be good to use all files, so that there are no problems with different code due to different SDK versions.

                                 

                                In my post earlier (22/04/2013), I already posted the BufferBandwidth output:

                                2. GPU kernel read of inputBuffer

                                     clEnqueueNDRangeKernel():  0.038843 s      69.11 GB/s

                                 

                                The command line options for the output were just: -nb 134217728

                                However, I see similar effects with the default options.

                                 

                                Do you mean an discrete GPU with DGPU? I think my question is quite specific for a Trinity APU or better to the integrated GPU Radeon HD 7660D (Devastator). Is it possible for you to test on a Trinity APU to check if you see the same effects?

                                 

                                Thanks a lot.

                                  • Re: Max Read Bandwidth of Trinity APU
                                    himanshu.gautam

                                    Yes , I meant a discrete GPU by DGPU.

                                     

                                    I just tested it on the first test machine I could get my hands on (for a quick repro).

                                    I do understand that you are looking at an APU.

                                    Will try and let you know soon. Thanks for your patience,

                                      • Re: Max Read Bandwidth of Trinity APU
                                        dertomas

                                        BTW: I just tested it with an ubuntu linux (on the same APU) and I also see 69 GB/s read bandwidth.

                                          • Re: Max Read Bandwidth of Trinity APU
                                            dertomas

                                            While we wait:

                                            Is it at least theoretically feasible to reach such high Bandwidths withintegrated GPUs (with the RMB and dual channel RAM with 1866 MHz), or is it far from anything possible (so we should be trying to find a bug in the implementation)?

                                              • Re: Max Read Bandwidth of Trinity APU
                                                himanshu.gautam

                                                My personal answer:

                                                I don't think you can reach such high bandwidth using APUs - as long as the APU's memory is carved out of System Memory (i.e RAM). Thats what the attached  PDF is also saying.

                                                 

                                                May be, this will change in future....I have no idea. But chances are there that what is seen is incorrect reporting from the sample...

                                                I missed this thread a bit...I will run this on APU and see why this is happening...

                                                  • Re: Max Read Bandwidth of Trinity APU
                                                    himanshu.gautam

                                                    Just profile the two kernels using CodeXL. IN default case nThreads=8192, and i get 0% cache hit. When I make nThreads=maxthreads (2197152), i see a cache hit of 94%, which results in that high read bandwidth throughput. sorry for the late reply here.

                                                    You can also run globalmemorybandwidth sample to check the uncached bandwidth for your device.

                                                      • Re: Max Read Bandwidth of Trinity APU
                                                        dertomas

                                                        Thanks a lot, caching makes perfect sense.
                                                        I didn't realize that the GPU does automatic caching. I always thought the programmer has to do that explicitly.

                                                        I ran the GlobalMemoryBandwidth sample and got these results:

                                                        Device 0 : Devastator Device ID is 0000000002FB0090

                                                        1. Global Memory Read (single) = 303.891 GB/s
                                                        2. Global Memory Read (linear) = 281.221 GB/s
                                                        3. Global Memory Read (linear, uncached) =  16.4991 GB/s
                                                        4. Global Memory Write (linear) =  41.3259 GB/s
                                                        5. Global Memory Read (random) =  27.6856 GB/s
                                                        6. Global Memory Read (unCombine_unCache) =  19.2543 GB/s

                                                        So I guess the 3rd and 4th test case should be similar to the  BufferBandwidth test with the cache-hit rate 0% (unmodified  BufferBandwidth)?
                                                        For the original buffer bandwidth test I get 24,62 GB/s for reading and  21,10 GB/s for writing. That is quite different to case 3 and 4.
                                                        Is this caused by different memory access patterns and different float/int computing capabilities or do I miss something? (all tests show  0% cache hit in CodeXL).

                                                        I have got the full results of GlobalMemoryBandwidth.exe and BufferBandwidth.exe attached.

                                                          • Re: Max Read Bandwidth of Trinity APU
                                                            himanshu.gautam

                                                            Thanks for the tests.

                                                             

                                                            I am puzzled when you say there are no cache-hits listed in CodeXL and you still get ~300GBps for certain tests.

                                                            I think CodeXL is talking only about L2 Hits. The accesses must still be hitting the L1.

                                                            Understandable, as Not all Globalmembwidth tests the raw bandwidth. Some of them explicitly test the cache.

                                                            The linear-uncached test gives you the correct numbers that measures your raw bandwidth between compute-device and global-memory.

                                                            For more details, Check the kernels used by GlobalmemBandwidthTests.

                                                             

                                                            I am open for a discussion on the individual kernels used in globalMemBwidth tests here.

                                                            • Re: Max Read Bandwidth of Trinity APU
                                                              himanshu.gautam

                                                              The kernels used in the two samples are quite different, and were written by different people at quite different times. I will raise a bug on these samples, as indeed both the samples should return very similar values for uncached reads/writes.