Section 3.4 of the NVidia OpenCL Programming Guide v. 2.3 describes warp-level synchronization to avoid local memory barriers:
Because a warp executes one common instruction at a time, threads within a warp
are implicitly synchronized and this can be used to omit calls to the barrier()
function for better performance.
Is there a reference which helps to optimize OpenCL kernels for AMD GPUs similarly? From a quick search it seems the wavefront is the AMD equivalent to the warp. Does it also allow for implicit synchronization?
Is there an equivalent to the wavefront for the CPU device? I realize it has a substantially different architecture, but there is some value wrt emulating GPU execution for debugging purposes.
This seems more of a Khronos question, but does anyone know if this implicit synchronization capability is planned for the OpenCL spec at some point?