CodeXL GPU Profiler/APP Profiler doesn't change program behavior.
When you do zero copy (zero duration for map/unmap), profiler can't display data transfer bar on the timeline as we can't bracket this type of memory access unlike GPU activities. But kernel execution and memory transfer overlap under the hood.
To visualize it, you can use CLPerfMarker.
You can add Begin and End around your memory transfer CPU/host code.
Another type of kernel execution and memory transfer overlap is through DMA engine. OpenCL runtime can't time DMA transfer. Therefore, if profiling for the command queue is enabled, DMA engine is disabled and overlap won't happen. This has nothing to do with profiler.