I write a simple benchmark to test the bandwidth between the host and the deveice. I made experiments 10 times/trials for each data size. However, I find the performance in the first trial is poor, i.e. the bandwidth is lower than the other 9 trials. For example, the following is one of the output when the data amount is around 100MB:
(1) 2844.446506 (2) 5666.673999 (3) 5675.735765 (4) 5704.610900 (5) 5726.594426
(6) 5726.899885 (7) 5726.594426 (8) 5722.702677 (9) 5727.892852 (10) 5724.304521
The bandwidth in the first trial is only around half of the other 9 trials. Any explanations for it?
--------------------------------------------------------
The code structure is illustrated as follows:
_clInit(); //create OCL context, build program, etc.
_clMalloc(); //malloc memory on the device
loop 10 //copy data from the device to the host for 10 times
--begin
_clMemcpyD2H();
--end
_clRelease(); //release resources
---------------------------------------------------------
My testbed is illustrated as follows:
host: Intel920;
device: HD5870;
with AMD APP v2.4.