Hi everyone, I have written a micro-benchmark to test the bandwidth of reading data from global device memory in two ways. Their kernel functions are listed as follows (these two kernel read the same amout of data from memory):
kernel----1
@each work item read one element from the 2-d matrix
---------------------------------------------
__kernel void load_memory_scalar(const __global datatype * i_matrix, const int w, const int h){
int col = get_global_id(0);
int row = get_global_id(1);
datatype res = i_matrix(row, col, w);
}
kernel----2
@each work item read four elements from the 2-d matrix
---------------------------------------------
__kernel void load_memory_vector_row(const __global datatype * i_matrix, const int w, const int h){
int col = get_global_id(0)*VF;
int row = get_global_id(1);
datatype res_1 = i_matrix(row, (col+0), w);
datatype res_2 = i_matrix(row, (col+1), w);
datatype res_3 = i_matrix(row, (col+2), w);
datatype res_4 = i_matrix(row, (col+3), w);
}
The kernel program runs on HD5870, and APP v2.4.
The final results/bandwidth (GB/s) are listed as follows:
#wxh #kernel-1 #kernel-2
512x512 4.130009 7.237200
512x1024 6.225119 11.587529
1024x1024 4.259094 17.757329
1024x2048 9.543336 12.656704
2048x2048 30.121027 77.387819
2048x4096 32.307398 127.237581
4096x4096 48.562823 160.348056
4096x8192 51.715581 178.052931
8192x8192 52.673680 194.422363
The question is why the bandwidth changes so significantly, and even overtake the theoretical bandwidth (reading data from global memory). Can anybody tell me the reasons?
Thanks in advance.
Please refer to global memory bandwidth benchmark present in SDK samples.
Generally we get more than theoritical bandwidth if we are using cached reads. Are you using restrict pointers? or using the flag -fno-alias.
Thanks for your reminding, and I will take a careful look at the SDK code.
Also, what does flag '-fno-alias' mean?
1) for linear address, hardware will do coalesce read ,this will add read bandwidth, for example, for 16 continuous threads in a wave, system will use only one baseaddress, and return 16 datas in a batch.
2) as himanshu.gautam said, I think, cache can add read bandwidth, every cu has some L1 cache, about 2k, and every channel has about 64 L2 cache.
mikewolf_gfd, I think if we want to achive the theoretical bandwidth of reading data from global memory, we have to do as you said in 1).
-fno-alias is a compiler flag for Opencl compiler which specifies that there is no aliasing between the the global pointers. Although a preferred approach is to use const restrict with the read only global pointers.
You will be able to find many scenarios in the SDK sample I mentioned. Ofcourse best results are obtained with caching for read only buffers and
with co-alesced reads/writes with non-read-only buffers.
Thanks for your reminding, and I will take a careful look at the SDK code.
Also, what does flag '-fno-alias' mean?