bandwidth of reading data from device memory

Discussion created by haibo031031 on Jun 27, 2011
Latest reply on Jun 29, 2011 by himanshu.gautam

Hi everyone, I have written a micro-benchmark to test the bandwidth of reading data from global device memory in two ways. Their kernel functions are listed as follows (these two kernel read the same amout of data from memory):


@each work item read one element from the 2-d matrix


__kernel void load_memory_scalar(const __global datatype * i_matrix, const int w, const int h){
    int col = get_global_id(0);
    int row = get_global_id(1);
    datatype res = i_matrix(row, col, w);     



@each work item read four elements from the 2-d matrix


__kernel void load_memory_vector_row(const __global datatype * i_matrix, const int w, const int h){
    int col = get_global_id(0)*VF;
    int row = get_global_id(1);
    datatype res_1 = i_matrix(row, (col+0), w);     
    datatype res_2 = i_matrix(row, (col+1), w);
    datatype res_3 = i_matrix(row, (col+2), w);
    datatype res_4 = i_matrix(row, (col+3), w);

The kernel program runs on HD5870, and APP v2.4.

The final results/bandwidth (GB/s) are listed as follows:

#wxh        #kernel-1     #kernel-2

512x512    4.130009    7.237200  
512x1024    6.225119    11.587529   
1024x1024    4.259094    17.757329   
1024x2048    9.543336    12.656704   
2048x2048    30.121027    77.387819 
2048x4096    32.307398    127.237581
4096x4096    48.562823    160.348056
4096x8192    51.715581    178.052931
8192x8192    52.673680    194.422363 

The question is why the bandwidth changes so significantly, and even overtake the theoretical bandwidth (reading data from global memory). Can anybody tell me the reasons?


Thanks in advance.