Global memory bandwidth

Discussion created by j_vd_sanden on Jul 21, 2011
Latest reply on Jul 29, 2011 by rick.weber


I have trouble understanding some performance numbers when running a simple copy kernel on an AMD HD5850 GPU. I use two versions: one that uses uint and one that uses uint4 arrays, it just copies an amount of data from global to global memory. Below you see the kernels and the performances from running it with different input sizes. For the Copy4 kernel I adjust the work size accordingly, output results are verified and correct.

__kernel void Copy(__global uint* input, __global uint* output)
    uint p = get_global_id(0);
    output[p] = input[p];

input size:        performance:
512x512            0.020 ms (102.056 GB/s)
1024x1024        0.069 ms (116.134 GB/s)
2048x2048        0.266 ms (120.090 GB/s)
4096x4096        1.102 ms (116.202 GB/s)

__kernel void Copy4(__global uint4* input, __global uint4* output)
    uint p = get_global_id(0);
    output[p] = input[p];

input size        performance
512x512            0.012 ms (160.244 GB/s)
1024x1024        0.041 ms (194.813 GB/s)
2048x2048        0.194 ms (164.977 GB/s)
4096x4096        0.914 ms (140.045 GB/s)

When I use uint4 arrays I get bandwidths which are far beyond the theoretical bandwidth of this device (128 GB/s). I thought about caches as the reason, but all data still has to go through the global memory channel which has a max throughput of 128 GB/s... What could be an explanation for these numbers?