When I read the dialogue and code snippet above I was inclined to ask. In your kernel, do you have a write(s) to global memory before the barrier(CLK_GLOBAL_MEM_FENCE) and then possibly attempt to re-use the written values within the same kernel by performing some read(s) after the barrier?
Many thanks for your inquiry and suggestion. Three (3) __kernels are repeatedly invoked as a "trio". A read-after-write does not occur within any of the three kernels. Kernel #1 receives Host-CPU data via a clEnqueueWriteBuffer. Kernel-1 delivers [intermediate] results to Kernel-2 by way of a pass-by-reference pointer to a distinct Global memory buffer. Likewise, Kernel-2 delivers its intermediate results to Kernel-3 in the same way via its own distinct Global memory buffer. Both memory buffers are referred to as "bounce buffers" and are configured as CL_MEM_READ_WRITE and CL_MEM_HOST_NO_ACCESS. Each kernel creates a cl_event object and that object governs the invocation of the subsequent kernel. Completion of the "trio" is governed via a clEnqueueBarrierWithWaitList (to force the GPU to complete all queued tasks) followed by a clWaitForEvents (to block the Host-CPU until the trio is completed).
At this point I'm wondering if there's an unaccounted delay after the completion of the aforementioned process governors while the GPU writes results to its Global memory (i.e. a delay following the issuance of CL_SUCCESS at the completion of the clEnqueueBarrierWithWaitList and clWaitForEvents). In other words, it's assumed that the aforementioned process governors' CL_SUCCESS status (and their corresponding cl_event objects CL_COMPLETE status) account for the completion of all __kernel memory-writes to Global memory (i.e. not delayed as a result of a Global cache's eventual "write-back" with [or without] possible write-combining). Further, it's assumed that any memory-write would invalidate the corresponding Global memory location (or cache-line) to enforce an update for a subsequent memory-read from that same location by a subsequent __kernel (a process that's akin to the MESI [Modify, Exclusive, Shared, or Invalidate] memory consistency protocol). As an aside, I remember (years ago) about VxWorks' "virtual = physical" memory that required a memory-read immediately following any memory-write to force Cache-to-Host memory consistency between the Host processor's write-combining write-back cache and Host's SDRAM (Global) memory. Maybe that's going on here?