8 Replies Latest reply on Apr 27, 2012 5:36 PM by ibird

    Global Memory BandWidth

    ibird

      I have a kernel memory bounded, that on the ATI 5870 reach about 70GB/s very far from 153GB/s, all possible optimization has been done regarding coalescing, occupancy, workgroup size. ( On C2050 i reach ~ 120-130GB/s) The kernel use 57 vGPRs so 256 Active thread 4 Wavefronts (Enought to hide latency ?)     

       

      On the other hand i have tested my 5870 with the AMD benchmark GlobalBandWidth

       

      Where i get for uncached read: 77GB/s, reading the kernel for uncached (but coalesced) read, i can say that the read method used by this kernel is similar

      to those on my kernel. As far as i know uncached read should measure the bandwidth of the GDDR5 so i should get something more near to 153 GB/s. So the question is someone with an 5870 can confirm this number or there is something i am missing ?

        • Re: Global Memory BandWidth
          ibird

          LDS is only 3KB and workgroup 128

          • Re: Global Memory BandWidth
            gautam.himanshu

            I would be good , if you can share your approach. I guess all the channels must be busy to get 153GBps. Also profiling may also help you, check if you are getting any channel/bank conflicts.

              • Re: Global Memory BandWidth
                ibird

                Sure

                 

                Access is of type float2 or float4, this mean that each wavefront access to all adjacent byte of 2 or 4 (different) channels

                (CODE attached is not simple to understand and complex to isolate bacause its structure, so you need to have faith, access are adjacent for geometric property of the tables and storage used)

                 

                On the other hand

                 

                From other observations the accesses can be assumed uncached and linear from wavefront point of view. So just like linear uncache test into AMD GlobalMemoryBandwidth test code

                 

                Now, the AMD GlobalMemoryBandwidth for linear uncache ( DATAType = float4 ) perform only 77GB/s (similar to my code), this do not sound good for me

                I am expecting 130GB/s or something similar

                 

                Undestanding why this test perform so poor i can fix the problem on my code.

                So the question is:

                 

                Why  AMD GlobalMemoryBandwidth for linear uncache on AMD 5870 do not reach 130GB/s ?

                  • Re: Global Memory BandWidth
                    jeff_golds

                    My own tests do much better.  On HD5870, I have hit 143 GB/s uncached read speed.  Even on HD6870, I can hit 121 GB/s out of a peak of 134 GB/s.

                     

                    What does the profiler tell you?  Maybe you are getting bank and/or channel collisions.

                      • Re: Global Memory BandWidth
                        ibird

                        Global Memory Read

                        AccessType      : single

                        VectorElements  : 4

                        Bandwidth       : 1061.48 GB/s

                         

                        Global Memory Read

                        AccessType      : linear

                        VectorElements  : 4

                        Bandwidth       : 618.043 GB/s

                         

                        Global Memory Read

                        AccessType      : linear(uncached)

                        VectorElements  : 4

                        Bandwidth       : 77.0595 GB/s

                         

                        Global Memory Write

                        AccessType      : linear

                        VectorElements  : 4

                        Bandwidth       : 153.615 GB/s

                         

                         

                        # ProfilerVersion=2.4.1314

                        # Application=/opt/AMDAPP/samples/opencl/bin/x86_64/GlobalMemoryBandwidth

                        # ApplicationArgs=

                        # Device Cypress PlatformVendor=Advanced Micro Devices, Inc.

                        # Device Cypress PlatformName=AMD Accelerated Parallel Processing

                        # Device Cypress PlatformVersion=OpenCL 1.1 AMD-APP (831.4)

                        # Device Cypress CLDriverVersion=CAL 1.4.1646

                        # Device Cypress CLRuntimeVersion=OpenCL 1.1 AMD-APP (831.4)

                        # Device Cypress NumberAppAddressBits=32

                        # Device Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz PlatformVendor=Advanced Micro Devices, Inc.

                        # Device Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz PlatformName=AMD Accelerated Parallel Processing

                        # Device Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz PlatformVersion=OpenCL 1.1 AMD-APP (831.4)

                        # Device Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz CLDriverVersion=2.0

                        # Device Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz CLRuntimeVersion=OpenCL 1.1 AMD-APP (831.4)

                        # Device Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz NumberAppAddressBits=64

                        # OS=Ubuntu 11.04 \n \l

                        Method , ExecutionOrder , ThreadID , CallIndex , GlobalWorkSize , WorkGroupSize , Time , LocalMemSize , VGPRs , SGPRs , ScratchRegs , FCStacks , Wavefronts , ALUInsts , FetchInsts , WriteInsts , LDSFetchInsts , LDSWriteInsts , ALUBusy , ALUFetchRatio , ALUPacking , FetchSize , CacheHit , FetchUnitBusy , FetchUnitStalled , WriteUnitStalled , FastPath , CompletePath , PathUtilization , LDSBankConflict

                         

                         

                         

                         

                        read_linear_uncached__k3_Cypress1 ,  2638 , 2853 , 13297 , {1048576       1       1} , {  256     1     1} ,         6.97622 ,           0 ,     8 , NA ,     0 ,     0 ,     16384.00 ,        46.00 ,        32.00 ,         1.00 ,         0.00 ,         0.00 ,         2.54 ,         1.44 ,        86.52 ,    524288.00 ,         0.00 ,        23.04 ,        17.30 ,        54.48 ,     16385.00 ,         0.00 ,       100.00 ,         0.00

                         

                         

                        This is from my kernel ( uncached only 9% )   similar percentage on Fetch Busy Stalled Stalled  (3K local memsize)  19 (????)  VGPRS (profiler inside APP SDK) i remeber 57 from the other profiler  ( Profiler downloaded separately ).  57 is more realistic counting the active wavefronts

                         

                        DslashKernelEO__k5_Cypress1 ,   436 , 3514 , 14644 , {  65536       1       1} , {  128     1     1} , 0.49078 ,        3584 ,    19 , NA ,     0 ,     0 ,      1024.00 ,       235.00 ,        65.00 ,         6.00 ,         3.00 , 7.00 ,        10.93 ,         3.62 ,        80.09 ,     37885.19 , 9.69 ,        27.73 ,        16.99 ,        49.95 ,      3072.25 ,         0.00 ,       100.00 ,         0.00

                         

                         

                        Do not sound good

                          • Re: Global Memory BandWidth
                            nyanthiss

                            Possibly related: i recently noticed that running MemoryOptimizations (from AMD APP samples) on my gpu produces entirely different results under windows 7 than linux (RHEL 6.2), see attachments.

                             

                            @ ibird: perhaps you could try compiling & running GlobalMemoryBandwith under windows to see if it makes any difference ?

                              • Re: Global Memory BandWidth
                                ibird

                                This is the result on windows, linear uncache is a little better 80 but far fro 140GB/s

                                 

                                 

                                 

                                Platform 0 : Advanced Micro Devices, Inc.

                                Platform found : Advanced Micro Devices, Inc.

                                 

                                Selected Platform Vendor : Advanced Micro Devices, Inc.

                                Device 0 : Cypress Device ID is 0000000002054400

                                Build Options are : -D DATATYPE=float4 -D OFFSET=16384

                                 

                                Global Memory Read

                                AccessType      : single

                                VectorElements  : 4

                                Bandwidth       : 1061.29 GB/s

                                 

                                Global Memory Read

                                AccessType      : linear

                                VectorElements  : 4

                                Bandwidth       : 617.82 GB/s

                                 

                                Global Memory Read

                                AccessType      : linear(uncached)

                                VectorElements  : 4

                                Bandwidth       : 80.9219 GB/s

                                 

                                Global Memory Write

                                AccessType      : linear

                                VectorElements  : 4

                                Bandwidth       : 151.803 GB/s

                      • Re: Global Memory BandWidth
                        elizabethswell

                        I like your approach because you are working well now a days.....