6 Replies Latest reply on Jun 29, 2011 7:26 AM by himanshu.gautam

    bandwidth of reading data from device memory


      Hi everyone, I have written a micro-benchmark to test the bandwidth of reading data from global device memory in two ways. Their kernel functions are listed as follows (these two kernel read the same amout of data from memory):


      @each work item read one element from the 2-d matrix


      __kernel void load_memory_scalar(const __global datatype * i_matrix, const int w, const int h){
          int col = get_global_id(0);
          int row = get_global_id(1);
          datatype res = i_matrix(row, col, w);     



      @each work item read four elements from the 2-d matrix


      __kernel void load_memory_vector_row(const __global datatype * i_matrix, const int w, const int h){
          int col = get_global_id(0)*VF;
          int row = get_global_id(1);
          datatype res_1 = i_matrix(row, (col+0), w);     
          datatype res_2 = i_matrix(row, (col+1), w);
          datatype res_3 = i_matrix(row, (col+2), w);
          datatype res_4 = i_matrix(row, (col+3), w);

      The kernel program runs on HD5870, and APP v2.4.

      The final results/bandwidth (GB/s) are listed as follows:

      #wxh        #kernel-1     #kernel-2

      512x512    4.130009    7.237200  
      512x1024    6.225119    11.587529   
      1024x1024    4.259094    17.757329   
      1024x2048    9.543336    12.656704   
      2048x2048    30.121027    77.387819 
      2048x4096    32.307398    127.237581
      4096x4096    48.562823    160.348056
      4096x8192    51.715581    178.052931
      8192x8192    52.673680    194.422363 

      The question is why the bandwidth changes so significantly, and even overtake the theoretical bandwidth (reading data from global memory). Can anybody tell me the reasons?


      Thanks in advance.