Hi,
I have trouble understanding some performance numbers when running a simple copy kernel on an AMD HD5850 GPU. I use two versions: one that uses uint and one that uses uint4 arrays, it just copies an amount of data from global to global memory. Below you see the kernels and the performances from running it with different input sizes. For the Copy4 kernel I adjust the work size accordingly, output results are verified and correct.
__kernel void Copy(__global uint* input, __global uint* output)
{
uint p = get_global_id(0);
output
= input
;
}
input size: performance:
512x512 0.020 ms (102.056 GB/s)
1024x1024 0.069 ms (116.134 GB/s)
2048x2048 0.266 ms (120.090 GB/s)
4096x4096 1.102 ms (116.202 GB/s)
__kernel void Copy4(__global uint4* input, __global uint4* output)
{
uint p = get_global_id(0);
output
= input
;
}
input size performance
512x512 0.012 ms (160.244 GB/s)
1024x1024 0.041 ms (194.813 GB/s)
2048x2048 0.194 ms (164.977 GB/s)
4096x4096 0.914 ms (140.045 GB/s)
When I use uint4 arrays I get bandwidths which are far beyond the theoretical bandwidth of this device (128 GB/s). I thought about caches as the reason, but all data still has to go through the global memory channel which has a max throughput of 128 GB/s... What could be an explanation for these numbers?
Hi,
Could it have to do with how you do the timing? Do you ensure that the memory transfer is finished with cl_finish()? What functions do you use for the timing?
Cheers,
Dominic
I'm using clWaitForEvents() and then clGetEventProfilingInfo() to query start and end times of the kernel. But I also tried cl_finish() and that gave exactly the same results.
Micah, could you pls explain to me how this is caused by caching. From my point of view before starting a kernel all input data is in global memory and after finishing the kernel all output data is in global memory. So during kernel execution time all data crosses the global memory channel twice, which has a max throughput of 128 GB/s?
Thanks,
Jarno
Oke, thanks for the explanation. So if I understand correctly fetching data from global memory into the last level cache has a higher throughput (at least higher than 128 GB/s for the 5850), else it would still not be possible to cross the 128 GB/s limit?
Originally posted by: MicahVillmow No, Global memory on your chip has 128GB/s limit, but your not testing just global memory. Your testing global memory + L1 and L2 cache speeds. Because the L1 and L2 speeds are dramatically faster, it will make your overall numbers faster than the global memory bandwidth.
Does that explain why the throughput goes down for larger buffer sizes in the Copy4 case (more cache misses)?
Originally posted by: MicahVillmow No, Global memory on your chip has 128GB/s limit, but your not testing just global memory. Your testing global memory + L1 and L2 cache speeds. Because the L1 and L2 speeds are dramatically faster, it will make your overall numbers faster than the global memory bandwidth.
But the data in L1 and L2 is also loaded from global device memory. In the above-mentioned example, all the data is loaded once from the global device memory. In my opinion, in whatever case, the bandwidth will never overtake 128 GB/s. Could you give me more explanations?
This situation has me a bit confused. Your formula is 2 * sizeof(uint) * n correct?
Firstly, since he's not using const restrict, I didn't think he would get cached reads, just coalescing.
How can caching give you greater than global memory bandwidth if you never reuse data beyond hitting adjacent elements in the same cache line? If you have a 64 byte cache line and you hit each element in the line exactly once, then you should still be global memory bound since each line load is actually 64b from global memory.
Unfortunately, I don't understand how one can get more than 140GB/s throughput. My logic is the following: there is no data reuse so all the data should be read from global memory. Whether some of these data will end up in L1/L2 caches or not is irrelevant.
P.S. Micah, when we can expect SDK 2.5 release, please? I hope issue(s) with #pragma unroll are fixed in that release.
Oh, I have an idea. It is MEM_RAT_CACHELESS_STORE. When all work-iems finishes the hardware is still streaming data to the global memory.
Originally posted by: MicahVillmow Also, one thing to keep in mind, The architecture has always cached data within a texture clause, but is invalidated after the texture clause is over. Caching which was introduced in SDK 2.4, allowed caching between texture clauses.
I think this reply can also explain the performance numbers in the post (VF=4). Right?
Yes. Even without caching, the GPU is capable of data reuse within a fetch clause. For example,
kernel void foo(global float4 *out, global float4 *in, const uint zero)
{
uint gid = get_global_id(0);
out[gid] = in[gid] + in[gid+zero]+in[gid+zero]+in[gid+zero];
}
If zero == 0, then all the fetches are from the same location, so you will hit the cache *if all the fetches are scheduled to the same fetch clause*.
Jeff
Hey guys, I did another experiments (the code is attached). The difference is that this test case only loads data from device global memory and there is no data store operations:
VF can be seen as vector factor, normally VF=4, like float4 or int4. Here VF=1,2,4, 8, ..., 128.
Testbed: HD6970, AMD APP v2.4, and Ubuntu 11.04.
Performance numbers (global memory bandwidth) are listed as follows (for three different problem sizes):
1024x1024 | 2048x2048 | 4096x4096 | |
VF=1 | 23.42 | 41.68 | 48.47 |
VF=2 | 33.37 | 64.92 | 90.06 |
VF=4 | 40.75 | 97.91 | 167.35 |
VF=8 | 45.92 | 133.43 | 267.21 |
VF=16 | 59.05 | 164.67 | 389.95 |
VF=32 | 55.49 | 182.12 | 531.92 |
VF=64 | 55.42 | 213.87 | 644.10 |
VF=128 | 68.04 | 221.97 | 723.39 |
As can be seen from the table, when VF=128 and proble sized 4096x4096, we can get 723 GB/s bandwidth (its theoretical bandwidth is around 170 GB/s). This is also due to L1/L2 cache?
__kernel void load_memory_VF(const __global float * i_matrix){ int idx = get_global_id(0)*VF; float res_0; float res_1; ...... float res_VF-1; res_0=i_matrix[(idx+0)]; res_1=i_matrix[(idx+1)]; ... res_VF-1 = i_matrix[idx+VF-1] ; }
I suspect that the memory loads could be optimized away if there are no stores to global memory. The entire kernel would then be reduced to a no-op.
Originally posted by: dmeiser I suspect that the memory loads could be optimized away if there are no stores to global memory. The entire kernel would then be reduced to a no-op.
Yes, I think so- it is because of the agressive compiler:-)
Did someone verify the performance numbers by executing the streaming copy kernel? Do you also get bandwiths above the theoretical global memory bandwidth?
Is there a possibility to turn of or bypass the caches to see if it is cache-related?
About the timing, when does a kernel actually finish? Is it when the last byte is written into global memory or is it when the last work-item finishes the final instruction?
Actually, you're calling clFinish() right after you issue the kernel right? If not, you're just timing how long it takes to put a kernel into the command queue, not its execution time.
On the GPU, streaming copies cannot exceed global memory bandwidth, even with caching, as the bottleneck is how fast you can fill L1/L2. In this case, L1/L2 help prevent you from going to global memory more than once per cacheline, nothing more.
Make sure your timing is correct and that you are waiting for kernel execution to finish.
Jeff
Micah, your explanation is clear, but it doesn' explain why there is a peak at 4MB transfer. The algorithm you described provides steady transfer speed starting from rather small buffers.
Honestly, I beleive that you juat described how 140GB/s is achieved, no less and no more.