BarnacleJunior

42 GB/s memory bandwidth on HD5850 DX11

Discussion created by BarnacleJunior on Jan 9, 2010
Latest reply on Jan 9, 2010 by BarnacleJunior
theoretical is 128GB/s

Is 42GB/s really correct for memory bandwidth on the HD5850 - what happened to 128GB/s?  I tested ping-ponging between 8MB arrays 20,000 times under cs_5_0 with a variety of parameters (type - uint, uint2, uint4; number of threads; number of values per thread), and 42GB/s is the peak no matter the parameters, as long as each thread tries to stream 32bytes or fewer.  Above that and it craters.  This is consistent with performance of some fairly ALU dense routines I've written - they top out at 40GB/s.  Will driver updates bring this term up, or is it a DX problem, or a structured buffer problem?

 

Compiling... Benchmarking... (c=1, t=64, v=1): 33.753GB/s 33.922GB/s 33.918GB/s (c=1, t=64, v=2): 41.767GB/s 41.791GB/s 41.790GB/s (c=1, t=64, v=4): 41.559GB/s 41.581GB/s 41.581GB/s (c=1, t=64, v=8): 41.515GB/s 41.550GB/s 41.521GB/s (c=1, t=64, v=16): 15.478GB/s 15.486GB/s 15.480GB/s (c=1, t=64, v=32): 13.527GB/s 13.539GB/s 13.536GB/s (c=1, t=128, v=1): 41.777GB/s 41.799GB/s 41.797GB/s (c=1, t=128, v=2): 41.730GB/s 41.760GB/s 41.760GB/s (c=1, t=128, v=4): 41.536GB/s 41.581GB/s 41.592GB/s (c=1, t=128, v=8): 41.192GB/s 41.223GB/s 41.176GB/s (c=1, t=128, v=16): 15.486GB/s 15.501GB/s 15.495GB/s (c=1, t=128, v=32): 13.177GB/s 13.186GB/s 13.183GB/s (c=1, t=256, v=1): 41.776GB/s 41.799GB/s 41.796GB/s (c=1, t=256, v=2): 41.719GB/s 41.745GB/s 41.748GB/s (c=1, t=256, v=4): 41.229GB/s 41.253GB/s 41.255GB/s (c=1, t=256, v=8): 41.276GB/s 41.349GB/s 41.321GB/s (c=1, t=256, v=16): 14.887GB/s 14.894GB/s 14.891GB/s (c=1, t=256, v=32): 13.069GB/s 13.078GB/s 13.079GB/s (c=2, t=64, v=1): 41.766GB/s 41.761GB/s 41.790GB/s (c=2, t=64, v=2): 41.550GB/s 40.516GB/s 41.575GB/s (c=2, t=64, v=4): 41.475GB/s 41.560GB/s 41.560GB/s (c=2, t=64, v=8): 15.586GB/s 15.592GB/s 15.596GB/s (c=2, t=64, v=16): 13.887GB/s 13.892GB/s 13.892GB/s (c=2, t=64, v=32): 10.235GB/s 10.245GB/s 10.245GB/s (c=2, t=128, v=1): 41.695GB/s 41.751GB/s 41.672GB/s (c=2, t=128, v=2): 41.539GB/s 41.547GB/s 41.576GB/s (c=2, t=128, v=4): 41.190GB/s 41.218GB/s 41.219GB/s (c=2, t=128, v=8): 15.583GB/s 15.593GB/s 15.593GB/s (c=2, t=128, v=16): 13.365GB/s 13.369GB/s 13.369GB/s (c=2, t=128, v=32): 10.186GB/s 10.196GB/s 10.196GB/s (c=2, t=256, v=1): 41.689GB/s 41.708GB/s 41.708GB/s (c=2, t=256, v=2): 41.220GB/s 41.244GB/s 41.247GB/s (c=2, t=256, v=4): 41.314GB/s 41.311GB/s 41.314GB/s (c=2, t=256, v=8): 15.189GB/s 15.200GB/s 15.196GB/s (c=2, t=256, v=16): 13.298GB/s 13.311GB/s 13.310GB/s (c=2, t=256, v=32): 9.851GB/s 9.857GB/s 9.860GB/s (c=4, t=64, v=1): 41.520GB/s 41.576GB/s 41.542GB/s (c=4, t=64, v=2): 41.518GB/s 41.525GB/s 41.528GB/s (c=4, t=64, v=4): 16.848GB/s 16.860GB/s 16.852GB/s (c=4, t=64, v=8): 13.633GB/s 13.639GB/s 13.642GB/s (c=4, t=64, v=16): 10.281GB/s 10.292GB/s 10.292GB/s (c=4, t=64, v=32): 6.843GB/s 6.850GB/s 6.850GB/s (c=4, t=128, v=1): 41.558GB/s 41.550GB/s 41.548GB/s (c=4, t=128, v=2): 41.184GB/s 41.217GB/s 41.224GB/s (c=4, t=128, v=4): 16.510GB/s 16.526GB/s 16.520GB/s (c=4, t=128, v=8): 13.223GB/s 13.229GB/s 13.227GB/s (c=4, t=128, v=16): 10.184GB/s 10.194GB/s 10.192GB/s (c=4, t=128, v=32): 6.738GB/s 6.749GB/s 6.749GB/s (c=4, t=256, v=1): 41.191GB/s 41.216GB/s 41.222GB/s (c=4, t=256, v=2): 41.280GB/s 41.358GB/s 41.356GB/s (c=4, t=256, v=4): 16.066GB/s 16.090GB/s 16.061GB/s (c=4, t=256, v=8): 13.078GB/s 13.099GB/s 13.095GB/s (c=4, t=256, v=16): 9.804GB/s 9.819GB/s 9.822GB/s (c=4, t=256, v=32): 6.613GB/s 6.619GB/s 6.620GB/s #if COMPONENTS==1 #define T uint #elif COMPONENTS==2 #define T uint2 #elif COMPONENTS==4 #define T uint4 #endif StructuredBuffer<T> source_srv : register(t0); RWStructuredBuffer<T> target_uav : register(u0); [numthreads(NUM_THREADS, 1, 1)] void CopyBuffers(uint tid : SV_GroupIndex, uint3 groupID : SV_GroupID) { uint gid = groupID.x; uint target = VALUES_PER_THREAD * (NUM_THREADS * gid + tid); T values[VALUES_PER_THREAD]; [unroll] for(uint i = 0; i < VALUES_PER_THREAD; ++i) values[i] = source_srv[target + i] + i + 1; [unroll] for(i = 0; i < VALUES_PER_THREAD; ++i) target_uav[target + i] = values[i]; }

Outcomes