Archives Discussions

haibo031031 · ‎06-27-2011

Hi everyone, I have written a micro-benchmark to test the bandwidth of reading data from global device memory in two ways. Their kernel functions are listed as follows (these two kernel read the same amout of data from memory):

kernel----1

@each work item read one element from the 2-d matrix

---------------------------------------------

__kernel void load_memory_scalar(const __global datatype * i_matrix, const int w, const int h){

    int col = get_global_id(0);
    int row = get_global_id(1);
    datatype res = i_matrix(row, col, w);
}

kernel----2

@each work item read four elements from the 2-d matrix

---------------------------------------------

__kernel void load_memory_vector_row(const __global datatype * i_matrix, const int w, const int h){

    int col = get_global_id(0)*VF;
    int row = get_global_id(1);
    datatype res_1 = i_matrix(row, (col+0), w);
    datatype res_2 = i_matrix(row, (col+1), w);
    datatype res_3 = i_matrix(row, (col+2), w);
    datatype res_4 = i_matrix(row, (col+3), w);
}

The kernel program runs on HD5870, and APP v2.4.

The final results/bandwidth (GB/s) are listed as follows:

#wxh #kernel-1 #kernel-2

512x512    4.130009    7.237200
512x1024    6.225119    11.587529
1024x1024    4.259094    17.757329
1024x2048    9.543336    12.656704
2048x2048    30.121027    77.387819
2048x4096    32.307398    127.237581
4096x4096    48.562823    160.348056
4096x8192    51.715581    178.052931
8192x8192    52.673680    194.422363

The question is why the bandwidth changes so significantly, and even overtake the theoretical bandwidth (reading data from global memory). Can anybody tell me the reasons?

Thanks in advance.

himanshu_gautam · ‎06-28-2011

Please refer to global memory bandwidth benchmark present in SDK samples.

Generally we get more than theoritical bandwidth if we are using cached reads. Are you using restrict pointers? or using the flag -fno-alias.

haibo031031 · ‎06-29-2011

Thanks for your reminding, and I will take a careful look at the SDK code.

Also, what does flag '-fno-alias' mean?

mikewolf_gkd · ‎06-29-2011

1) for linear address, hardware will do coalesce read ,this will add read bandwidth, for example, for 16 continuous threads in a wave, system will use only one baseaddress, and return 16 datas in a batch.

2) as himanshu.gautam said, I think, cache can add read bandwidth, every cu has some L1 cache, about 2k, and every channel has about 64 L2 cache.

haibo031031 · ‎06-29-2011

mikewolf_gfd, I think if we want to achive the theoretical bandwidth of reading data from global memory, we have to do as you said in 1).

himanshu_gautam · ‎06-29-2011

-fno-alias is a compiler flag for Opencl compiler which specifies that there is no aliasing between the the global pointers. Although a preferred approach is to use const restrict with the read only global pointers.

You will be able to find many scenarios in the SDK sample I mentioned. Ofcourse best results are obtained with caching for read only buffers and

with co-alesced reads/writes with non-read-only buffers.

haibo031031 · ‎06-29-2011

Thanks for your reminding, and I will take a careful look at the SDK code.

Also, what does flag '-fno-alias' mean?

Archives Discussions

bandwidth of reading data from device memory