How do you find the slowest part? What size of memory are you reading? Are you running it in iterations?
The API does not seem to be using any cl_events. (EDITED)
Your code looks reasonable to give good read performance.
I measured the speed with using clFinish() calls before and after the opencl codes.
The memory size I was testing is 32 MB.
Yes, I am runnin iterations, 1-2000.
I tried the blocking with events, but it did not help.
For me the clEnqueueMapBuffer seem to be blocking for 5.5 ms even if it should not block. Is there any way for it to not block, or not copy? Do you have a sample code may be where it is not blocking?
Do you have a sample code may be where it is not blocking?
You should check AMD APP SDK Samples for that.
You can share your code here too (attach as a zip file), and other developers and point out bugs in it.
Also mention details about your setup: CPU, GPU, Driver, SDK, OS.