Measuring Kernel performance only in multi-core CPU environment

I want to run OpenCL in Multi-core CPU environment.  Because OpenCL is architecture-independent, it should run and I succeeded in running.

However, when evaluating performance and checking performance element, I got trouble.  I used CPU performance counter, but I cannot distinguish how long OpenCL runtime consumes and how long my own kernel consumes.  I also cannot distinguish cache miss count.

Is there any method to have kernel's data only?