you code is correct but it does not measure GPU copy speed.
In you test GPU has almost nothing to do the only time consuming part is on CPU site this is calling kernel_brookcopy(in, out); 500 times.
You can try to move this loop into kernel and call only once.
kernel void loop(float4 input<>, out float4 output<>, float itr)
for(i=0; i<itr; ++i) output = input + i;
Remember that this simple test will NOZ proveide yo with some useful information about GPU speed.
Please look at optimized_matmult sample.