Well, it is clear that barriers are almost necessary in all the kernel for synchronization purpose. But can there be some impact of barriers on the wavefront scheduler , like trying to execute wavefronts in a workgroup more closely to each other at barriers. Just speculating
Another query was related to mem_fence function. This seems like a synchronization functionality, but it is not blocking in nature. Any situation, where this will be preferred over barriers?
mem_fence(CLK_LOCAL_MEM_FENCE and/or
CLK_GLOBAL_MEM_FENCE):
waits until all reads/writes to local and/or global memory made by the calling work-item prior to mem_fence() are visible
to all threads in the work-group.
barrier(CLK_LOCAL_MEM_FENCE and/or CLK_GLOBAL_MEM_FENCE):
waits until all work-items in the work-group have reached this point and calls mem_fence(CLK_LOCAL_MEM_FENCE and/or
CLK_GLOBAL_MEM_FENCE)
http://developer.download.nvidia.com/presentations/2009/SIGGRAPH/asia/3_OpenCL_Programming.pdf
But can there be some impact of barriers on the wavefront scheduler , like trying to execute wavefronts in a workgroup more closely to each other at barriers. Just speculating
Yes, barriers can improve performance, somewhat. Usually it is because the barrier guarantees the timing relationship between wavefronts in a work group. If the memory access stride is well designed, this can result in some performance improvement.