I met an problem, when my final kernel finished, it tooks about 2 ms to execute in case I do not write result back to global mem. But if I did that ( ofcz I need that because I need the calculation result), the kernel time will be about 7.5 ms. The result is about 40 bytes for each work item, and total 640 work items. My question is why it is so expensive to write the result back to global mem? Di I have a better way to get the result back to CPU?