4 Replies Latest reply on Jan 9, 2010 6:08 PM by BarnacleJunior

    42 GB/s memory bandwidth on HD5850 DX11

    BarnacleJunior
      theoretical is 128GB/s

      Is 42GB/s really correct for memory bandwidth on the HD5850 - what happened to 128GB/s?  I tested ping-ponging between 8MB arrays 20,000 times under cs_5_0 with a variety of parameters (type - uint, uint2, uint4; number of threads; number of values per thread), and 42GB/s is the peak no matter the parameters, as long as each thread tries to stream 32bytes or fewer.  Above that and it craters.  This is consistent with performance of some fairly ALU dense routines I've written - they top out at 40GB/s.  Will driver updates bring this term up, or is it a DX problem, or a structured buffer problem?

       

      Compiling... Benchmarking... (c=1, t=64, v=1): 33.753GB/s 33.922GB/s 33.918GB/s (c=1, t=64, v=2): 41.767GB/s 41.791GB/s 41.790GB/s (c=1, t=64, v=4): 41.559GB/s 41.581GB/s 41.581GB/s (c=1, t=64, v=8): 41.515GB/s 41.550GB/s 41.521GB/s (c=1, t=64, v=16): 15.478GB/s 15.486GB/s 15.480GB/s (c=1, t=64, v=32): 13.527GB/s 13.539GB/s 13.536GB/s (c=1, t=128, v=1): 41.777GB/s 41.799GB/s 41.797GB/s (c=1, t=128, v=2): 41.730GB/s 41.760GB/s 41.760GB/s (c=1, t=128, v=4): 41.536GB/s 41.581GB/s 41.592GB/s (c=1, t=128, v=8): 41.192GB/s 41.223GB/s 41.176GB/s (c=1, t=128, v=16): 15.486GB/s 15.501GB/s 15.495GB/s (c=1, t=128, v=32): 13.177GB/s 13.186GB/s 13.183GB/s (c=1, t=256, v=1): 41.776GB/s 41.799GB/s 41.796GB/s (c=1, t=256, v=2): 41.719GB/s 41.745GB/s 41.748GB/s (c=1, t=256, v=4): 41.229GB/s 41.253GB/s 41.255GB/s (c=1, t=256, v=8): 41.276GB/s 41.349GB/s 41.321GB/s (c=1, t=256, v=16): 14.887GB/s 14.894GB/s 14.891GB/s (c=1, t=256, v=32): 13.069GB/s 13.078GB/s 13.079GB/s (c=2, t=64, v=1): 41.766GB/s 41.761GB/s 41.790GB/s (c=2, t=64, v=2): 41.550GB/s 40.516GB/s 41.575GB/s (c=2, t=64, v=4): 41.475GB/s 41.560GB/s 41.560GB/s (c=2, t=64, v=8): 15.586GB/s 15.592GB/s 15.596GB/s (c=2, t=64, v=16): 13.887GB/s 13.892GB/s 13.892GB/s (c=2, t=64, v=32): 10.235GB/s 10.245GB/s 10.245GB/s (c=2, t=128, v=1): 41.695GB/s 41.751GB/s 41.672GB/s (c=2, t=128, v=2): 41.539GB/s 41.547GB/s 41.576GB/s (c=2, t=128, v=4): 41.190GB/s 41.218GB/s 41.219GB/s (c=2, t=128, v=8): 15.583GB/s 15.593GB/s 15.593GB/s (c=2, t=128, v=16): 13.365GB/s 13.369GB/s 13.369GB/s (c=2, t=128, v=32): 10.186GB/s 10.196GB/s 10.196GB/s (c=2, t=256, v=1): 41.689GB/s 41.708GB/s 41.708GB/s (c=2, t=256, v=2): 41.220GB/s 41.244GB/s 41.247GB/s (c=2, t=256, v=4): 41.314GB/s 41.311GB/s 41.314GB/s (c=2, t=256, v=8): 15.189GB/s 15.200GB/s 15.196GB/s (c=2, t=256, v=16): 13.298GB/s 13.311GB/s 13.310GB/s (c=2, t=256, v=32): 9.851GB/s 9.857GB/s 9.860GB/s (c=4, t=64, v=1): 41.520GB/s 41.576GB/s 41.542GB/s (c=4, t=64, v=2): 41.518GB/s 41.525GB/s 41.528GB/s (c=4, t=64, v=4): 16.848GB/s 16.860GB/s 16.852GB/s (c=4, t=64, v=8): 13.633GB/s 13.639GB/s 13.642GB/s (c=4, t=64, v=16): 10.281GB/s 10.292GB/s 10.292GB/s (c=4, t=64, v=32): 6.843GB/s 6.850GB/s 6.850GB/s (c=4, t=128, v=1): 41.558GB/s 41.550GB/s 41.548GB/s (c=4, t=128, v=2): 41.184GB/s 41.217GB/s 41.224GB/s (c=4, t=128, v=4): 16.510GB/s 16.526GB/s 16.520GB/s (c=4, t=128, v=8): 13.223GB/s 13.229GB/s 13.227GB/s (c=4, t=128, v=16): 10.184GB/s 10.194GB/s 10.192GB/s (c=4, t=128, v=32): 6.738GB/s 6.749GB/s 6.749GB/s (c=4, t=256, v=1): 41.191GB/s 41.216GB/s 41.222GB/s (c=4, t=256, v=2): 41.280GB/s 41.358GB/s 41.356GB/s (c=4, t=256, v=4): 16.066GB/s 16.090GB/s 16.061GB/s (c=4, t=256, v=8): 13.078GB/s 13.099GB/s 13.095GB/s (c=4, t=256, v=16): 9.804GB/s 9.819GB/s 9.822GB/s (c=4, t=256, v=32): 6.613GB/s 6.619GB/s 6.620GB/s #if COMPONENTS==1 #define T uint #elif COMPONENTS==2 #define T uint2 #elif COMPONENTS==4 #define T uint4 #endif StructuredBuffer<T> source_srv : register(t0); RWStructuredBuffer<T> target_uav : register(u0); [numthreads(NUM_THREADS, 1, 1)] void CopyBuffers(uint tid : SV_GroupIndex, uint3 groupID : SV_GroupID) { uint gid = groupID.x; uint target = VALUES_PER_THREAD * (NUM_THREADS * gid + tid); T values[VALUES_PER_THREAD]; [unroll] for(uint i = 0; i < VALUES_PER_THREAD; ++i) values[i] = source_srv[target + i] + i + 1; [unroll] for(i = 0; i < VALUES_PER_THREAD; ++i) target_uav[target + i] = values[i]; }

        • 42 GB/s memory bandwidth on HD5850 DX11
          eduardoschardong

          In your code you are doing the same number of reads and writes so it's 64GB/s maximum for reading and 64GB/s for writting (are you taking it in account?), for same reason writting is slower than it.

           

            • 42 GB/s memory bandwidth on HD5850 DX11
              BarnacleJunior

              Oh.. I thought it was 128GB each way.  Well, I'm still 50% off.  I ran SiSoft Sandra test and it reported 100GB/s.  What did they do in the shader to get that extra 25%?

                • 42 GB/s memory bandwidth on HD5850 DX11
                  eduardoschardong

                  Don't know, maybe they only perform reads? Writtes are slower, I don't know why but only got about half the read bandwidth with writtes.

                   

                    • 42 GB/s memory bandwidth on HD5850 DX11
                      BarnacleJunior

                      Apparently so.. I ran my own read/write benchmarks (read is tricky to prevent the compiler from optimizing it all away) and I'm seeing that write is about 41.7GB/s and read is about 99.9 GB/s.  I also used strided access between threads (as in http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_Performance_Notes.pdf ).  It doesn't improve the peak bandwidth but it makes the worst cases not as bad.

                      How the 128GB/s figure is arrived at I don't know.  It's clear that to get a meaningful measure of bandwidth you can't add up the read/write values, rather you have to take the min of them.  If I'm writing a DWORD for every DWORD I read, the bandwidth is the lesser of the two (so 41.7).

                      I'm attaching the shader code for my read bandwidth test.  In this, v = number of DWORDs read per thread.  I'm writing one DWORD per threadgroup, and doing a barrier.  I had to come up with some kludgy way to prevent the compiler from eliminating the reads and this seems to do it.  Very surprising how variable the read values are.  The write values are very consistent at 41.7.

                      Compiling... Benchmarking... (t=64, v=8): 80.756GB/s 81.704GB/s 81.703GB/s (t=64, v=16): 81.036GB/s 81.291GB/s 81.344GB/s (t=64, v=32): 74.778GB/s 74.969GB/s 74.939GB/s (t=128, v=8): 96.150GB/s 96.408GB/s 96.442GB/s (t=128, v=16): 99.751GB/s 99.936GB/s 99.972GB/s (t=128, v=32): 84.152GB/s 84.359GB/s 84.261GB/s (t=256, v=8): 98.641GB/s 98.910GB/s 98.887GB/s (t=256, v=16): 95.946GB/s 96.147GB/s 96.047GB/s (t=256, v=32): 92.680GB/s 93.004GB/s 92.854GB/s (t=512, v=8): 96.164GB/s 96.625GB/s 96.577GB/s (t=512, v=16): 72.789GB/s 73.001GB/s 73.000GB/s (t=512, v=32): 91.743GB/s 92.042GB/s 92.040GB/s (t=1024, v=8): 82.150GB/s 82.265GB/s 82.163GB/s (t=1024, v=16): 84.664GB/s 84.992GB/s 84.972GB/s (t=1024, v=32): 79.538GB/s 79.714GB/s 79.645GB/s #define WAVEFRONT 64 #include "scancommon.hlsl" StructuredBuffer<uint> source_srv : register(t0); RWStructuredBuffer<uint> target_uav : register(u0); groupshared uint common[NUM_THREADS]; [numthreads(NUM_THREADS, 1, 1)] void CopyBuffers(uint tid : SV_GroupIndex, uint3 groupID : SV_GroupID) { uint gid = groupID.x; uint target = VALUES_PER_THREAD * NUM_THREADS * gid + tid; PrepareThreadSum(tid); Counter word; ReadCounterFromSRVStride(source_srv, target, word); common[tid] = HorizontalSum(word); barrier(); if((NUM_THREADS - 1) == tid) target_uav[gid] = common[0]; }