cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

haibo031031
Journeyman III

bandwidth of reading data from device memory

Hi everyone, I have written a micro-benchmark to test the bandwidth of reading data from global device memory in two ways. Their kernel functions are listed as follows (these two kernel read the same amout of data from memory):

kernel----1

@each work item read one element from the 2-d matrix

---------------------------------------------

__kernel void load_memory_scalar(const __global datatype * i_matrix, const int w, const int h){
                                 
    int col = get_global_id(0);
    int row = get_global_id(1);
    datatype res = i_matrix(row, col, w);     
}

 

kernel----2

@each work item read four elements from the 2-d matrix

---------------------------------------------

__kernel void load_memory_vector_row(const __global datatype * i_matrix, const int w, const int h){
                                 
    int col = get_global_id(0)*VF;
    int row = get_global_id(1);
    datatype res_1 = i_matrix(row, (col+0), w);     
    datatype res_2 = i_matrix(row, (col+1), w);
    datatype res_3 = i_matrix(row, (col+2), w);
    datatype res_4 = i_matrix(row, (col+3), w);
}

The kernel program runs on HD5870, and APP v2.4.

The final results/bandwidth (GB/s) are listed as follows:

#wxh        #kernel-1     #kernel-2

512x512    4.130009    7.237200  
512x1024    6.225119    11.587529   
1024x1024    4.259094    17.757329   
1024x2048    9.543336    12.656704   
2048x2048    30.121027    77.387819 
2048x4096    32.307398    127.237581
4096x4096    48.562823    160.348056
4096x8192    51.715581    178.052931
8192x8192    52.673680    194.422363 

The question is why the bandwidth changes so significantly, and even overtake the theoretical bandwidth (reading data from global memory). Can anybody tell me the reasons?

 

Thanks in advance.

0 Likes
6 Replies
himanshu_gautam
Grandmaster

Please refer to global memory bandwidth benchmark present in SDK samples.

Generally we get more than theoritical bandwidth if we are using cached reads. Are you using restrict pointers? or using the flag -fno-alias.

0 Likes

Thanks for your reminding, and I will take a careful look at the SDK code.

Also, what does flag '-fno-alias' mean?

0 Likes
mikewolf_gkd
Journeyman III

1) for linear address, hardware will do coalesce read ,this will add read bandwidth,  for example, for 16 continuous threads in a wave, system will use only one baseaddress, and return 16 datas in a batch.

2) as himanshu.gautam said, I think, cache can add read bandwidth, every cu has some L1 cache, about 2k, and every channel has about 64 L2 cache.

0 Likes

mikewolf_gfd, I think if we want to achive the theoretical bandwidth of reading data from global memory, we have to do as you said in 1).

0 Likes

-fno-alias is a compiler flag for Opencl compiler which specifies that there is no aliasing between the the global pointers. Although a preferred approach is to use const restrict with the read only global pointers.

You will be able to find many scenarios in the SDK sample I mentioned. Ofcourse best results are obtained with caching for read only buffers and

with co-alesced reads/writes with non-read-only buffers.

0 Likes
haibo031031
Journeyman III

Thanks for your reminding, and I will take a careful look at the SDK code.

Also, what does flag '-fno-alias' mean?

0 Likes