Archives Discussions

j_vd_sanden · ‎07-21-2011

Hi,

I have trouble understanding some performance numbers when running a simple copy kernel on an AMD HD5850 GPU. I use two versions: one that uses uint and one that uses uint4 arrays, it just copies an amount of data from global to global memory. Below you see the kernels and the performances from running it with different input sizes. For the Copy4 kernel I adjust the work size accordingly, output results are verified and correct.

__kernel void Copy(__global uint* input, __global uint* output)
{
uint p = get_global_id(0);
output

= input

;
}

input size:        performance:
512x512            0.020 ms (102.056 GB/s)
1024x1024        0.069 ms (116.134 GB/s)
2048x2048        0.266 ms (120.090 GB/s)
4096x4096        1.102 ms (116.202 GB/s)

__kernel void Copy4(__global uint4* input, __global uint4* output)
{
    uint p = get_global_id(0);
    output

= input

;
}

input size        performance
512x512            0.012 ms (160.244 GB/s)
1024x1024        0.041 ms (194.813 GB/s)
2048x2048        0.194 ms (164.977 GB/s)
4096x4096        0.914 ms (140.045 GB/s)

When I use uint4 arrays I get bandwidths which are far beyond the theoretical bandwidth of this device (128 GB/s). I thought about caches as the reason, but all data still has to go through the global memory channel which has a max throughput of 128 GB/s... What could be an explanation for these numbers?

dmeiser · ‎07-21-2011

Hi,

Could it have to do with how you do the timing? Do you ensure that the memory transfer is finished with cl_finish()? What functions do you use for the timing?

Cheers,

Dominic

MicahVillmow · ‎07-21-2011

j_vd_sanden,
You are seeing performance benefits from caching.

j_vd_sanden · ‎07-21-2011

I'm using clWaitForEvents() and then clGetEventProfilingInfo() to query start and end times of the kernel. But I also tried cl_finish() and that gave exactly the same results.

Micah, could you pls explain to me how this is caused by caching. From my point of view before starting a kernel all input data is in global memory and after finishing the kernel all output data is in global memory. So during kernel execution time all data crosses the global memory channel twice, which has a max throughput of 128 GB/s?

Thanks,

Jarno

MicahVillmow · ‎07-21-2011

The read path has a L1 and L2 cache of varying sizes. When you read from sequential data from sequential threads, you will get some cache hits and not every thread will have to go out to global memory for the data, hence the performance gain. If you want to test global bandwidth, you need to make sure your reads do not hit the same cacheline as another read.

j_vd_sanden · ‎07-21-2011

Oke, thanks for the explanation. So if I understand correctly fetching data from global memory into the last level cache has a higher throughput (at least higher than 128 GB/s for the 5850), else it would still not be possible to cross the 128 GB/s limit?

MicahVillmow · ‎07-21-2011

No,
Global memory on your chip has 128GB/s limit, but your not testing just global memory. Your testing global memory + L1 and L2 cache speeds. Because the L1 and L2 speeds are dramatically faster, it will make your overall numbers faster than the global memory bandwidth.

dmeiser · ‎07-21-2011

Originally posted by: MicahVillmow No, Global memory on your chip has 128GB/s limit, but your not testing just global memory. Your testing global memory + L1 and L2 cache speeds. Because the L1 and L2 speeds are dramatically faster, it will make your overall numbers faster than the global memory bandwidth.

Does that explain why the throughput goes down for larger buffer sizes in the Copy4 case (more cache misses)?

haibo031031 · ‎07-26-2011

Originally posted by: MicahVillmow No, Global memory on your chip has 128GB/s limit, but your not testing just global memory. Your testing global memory + L1 and L2 cache speeds. Because the L1 and L2 speeds are dramatically faster, it will make your overall numbers faster than the global memory bandwidth.

But the data in L1 and L2 is also loaded from global device memory. In the above-mentioned example, all the data is loaded once from the global device memory. In my opinion, in whatever case, the bandwidth will never overtake 128 GB/s. Could you give me more explanations?

MicahVillmow · ‎07-21-2011

Yes, the benefit you get from L1/L2 cache hits is decreasing with the amount of memory being pulled over the memory bus. If you go to a large enough size, you will eventually hit 128GB/s as the caching benefits become neglible.

MicahVillmow · ‎07-26-2011

haibo031031,
You are assuming that all the data is loaded individually of every other load, which is not the case with linear accesses. Each load pulls in N pieces of data, and if another work-item loads a piece of data that is part of the 'N', it will get it from L1 or L2 cache if it exists there, and only then would it go to global memory.

rick_weber · ‎07-26-2011

This situation has me a bit confused. Your formula is 2 * sizeof(uint) * n correct?

Firstly, since he's not using const restrict, I didn't think he would get cached reads, just coalescing.

How can caching give you greater than global memory bandwidth if you never reuse data beyond hitting adjacent elements in the same cache line? If you have a 64 byte cache line and you hit each element in the line exactly once, then you should still be global memory bound since each line load is actually 64b from global memory.

MicahVillmow · ‎07-26-2011

rick.weber,
In SDK 2.5, the OpenCL compiler will mark loads as cached if there has not been a store in the program yet. And you just stated how you get greater than global memory bandwidth, 'reuse data ... hitting adjacent elements in the same cache line'.

maximmoroz · ‎07-26-2011

Unfortunately, I don't understand how one can get more than 140GB/s throughput. My logic is the following: there is no data reuse so all the data should be read from global memory. Whether some of these data will end up in L1/L2 caches or not is irrelevant.

P.S. Micah, when we can expect SDK 2.5 release, please? I hope issue(s) with #pragma unroll are fixed in that release.

maximmoroz · ‎07-26-2011

Oh, I have an idea. It is MEM_RAT_CACHELESS_STORE. When all work-iems finishes the hardware is still streaming data to the global memory.

MicahVillmow · ‎07-26-2011

Also, one thing to keep in mind, The architecture has always cached data within a texture clause, but is invalidated after the texture clause is over. Caching which was introduced in SDK 2.4, allowed caching between texture clauses.

haibo031031 · ‎07-26-2011

Originally posted by: MicahVillmow Also, one thing to keep in mind, The architecture has always cached data within a texture clause, but is invalidated after the texture clause is over. Caching which was introduced in SDK 2.4, allowed caching between texture clauses.

I think this reply can also explain the performance numbers in the post (VF=4). Right?

jeff_golds · ‎07-26-2011

Yes. Even without caching, the GPU is capable of data reuse within a fetch clause. For example,

kernel void foo(global float4 *out, global float4 *in, const uint zero)

{

uint gid = get_global_id(0);

out[gid] = in[gid] + in[gid+zero]+in[gid+zero]+in[gid+zero];

}

If zero == 0, then all the fetches are from the same location, so you will hit the cache *if all the fetches are scheduled to the same fetch clause*.

Jeff

haibo031031 · ‎07-26-2011

Hey guys, I did another experiments (the code is attached). The difference is that this test case only loads data from device global memory and there is no data store operations:

VF can be seen as vector factor, normally VF=4, like float4 or int4. Here VF=1,2,4, 8, ..., 128.

Testbed: HD6970, AMD APP v2.4, and Ubuntu 11.04.

Performance numbers (global memory bandwidth) are listed as follows (for three different problem sizes):

	1024x1024	2048x2048	4096x4096
VF=1	23.42	41.68	48.47
VF=2	33.37	64.92	90.06
VF=4	40.75	97.91	167.35
VF=8	45.92	133.43	267.21
VF=16	59.05	164.67	389.95
VF=32	55.49	182.12	531.92
VF=64	55.42	213.87	644.10
VF=128	68.04	221.97	723.39

As can be seen from the table, when VF=128 and proble sized 4096x4096, we can get 723 GB/s bandwidth (its theoretical bandwidth is around 170 GB/s). This is also due to L1/L2 cache?

__kernel void load_memory_VF(const __global float * i_matrix){ int idx = get_global_id(0)*VF; float res_0; float res_1; ...... float res_VF-1; res_0=i_matrix[(idx+0)]; res_1=i_matrix[(idx+1)]; ... res_VF-1 = i_matrix[idx+VF-1] ; }

dmeiser · ‎07-26-2011

I suspect that the memory loads could be optimized away if there are no stores to global memory. The entire kernel would then be reduced to a no-op.

haibo031031 · ‎07-26-2011

Originally posted by: dmeiser I suspect that the memory loads could be optimized away if there are no stores to global memory. The entire kernel would then be reduced to a no-op.

Yes, I think so- it is because of the agressive compiler:-)

j_vd_sanden · ‎07-29-2011

Did someone verify the performance numbers by executing the streaming copy kernel? Do you also get bandwiths above the theoretical global memory bandwidth?

Is there a possibility to turn of or bypass the caches to see if it is cache-related?

About the timing, when does a kernel actually finish? Is it when the last byte is written into global memory or is it when the last work-item finishes the final instruction?

rick_weber · ‎07-29-2011

Actually, you're calling clFinish() right after you issue the kernel right? If not, you're just timing how long it takes to put a kernel into the command queue, not its execution time.

MicahVillmow · ‎07-26-2011

maximmoroz,
There is plenty of data re-use. You have to remember, it is not the case that only a single work-item is executing, but multiple work-items in parallel. Here is an example, although rather simplified.
You have a work-group size of 64 and a NDRange of 64 work-items..
Each work-item reads index get_global_id(0).
The first thread reads from global memory, index 0, but pulls in values 0-15 into L2, and 0-7 into L1.
Threads 1-7 read the data from L1 cache, not from global memory.
Thread 8's data is not in L1 cache, but exists in L2, so doesn't go out to global memory.
Thread 16's data is not in L1 or L2, so goes out to global memory, but pulls 16-31 into l2 and 16-23 into L1.
repeat up to 64 work-items and this is how you can get higher numbers than global memory bandwidth.
After all of the loads are serviced, the data gets invalidated, so the next set of loads repeats the whole process.

As I mentioned above, this is a very simplified view of the world, so isn't exactly accurate, but this is how you get data re-use.

jeff_golds · ‎07-26-2011

On the GPU, streaming copies cannot exceed global memory bandwidth, even with caching, as the bottleneck is how fast you can fill L1/L2. In this case, L1/L2 help prevent you from going to global memory more than once per cacheline, nothing more.

Make sure your timing is correct and that you are waiting for kernel execution to finish.

Jeff

maximmoroz · ‎07-26-2011

Micah, your explanation is clear, but it doesn' explain why there is a peak at 4MB transfer. The algorithm you described provides steady transfer speed starting from rather small buffers.

Honestly, I beleive that you juat described how 140GB/s is achieved, no less and no more.

Archives Discussions

Global memory bandwidth