What do you use in codeXL to measure your execution time?
CodeXL can profile in 2 different mode :
1) HW counter, which will give you the time spent in the GPU and only the GPU
2) API trace which will also give you the time spent in the driver/OS. In API trace you can see when you issue a specific command on host and when it was actually flushed to the GPU but also when it finished.
Here I bet you measured only the GPU execution time using codeXL, while in bash you measurethe CPU time so actually you see the overhead of the driver/OS
thanks for your reply. I use Application Timeline Trace to measure the program. the total elapsed time is 77ms, and the kernel's duration is 0.097ms