So I've read the Khronos docs over and over again and I still can't figure out when it is the correct situation to use a mem_fence. The explanation is somewhat imprecise: I understand that it tells the compiler not to reorder memory loads and stores across the fence, BUT:
1. Which loads and stores should be a concern? For example, it seems reasonable that the following could be problematic (Listing 1 - it's my code). I am loading from a global, modifying, then storing back. But, is it truly expected that the compiler could reorder these loads and stores so they could happen in the opposite order?
2. In the same listing, is it only the original global that is of concern (if at all), or is the second global a problem too. If so, then it seems like all of our code should be littered with mem_fences, since every kernel pretty much boils down to load from global, do stuff, store to global.
3. Aside from that, it is clear to me what mem_fence does. But what about its little cousins read_mem_fence and write_mem_fence? What exactly are they preventing the compiler from doing? Is read_mem_fence making sure that all reads happen in the order declared? What is an example of this being a problem? Ditto for the write_ function.
I'd welcome an explanation from somebody who understands them better than I.
I would love to see example code that has a problem that is solved with any of the mem_fence functions.
Thanks!
Listing 1: float4 body = v_bodies[idx]; body.w = 0.0f; ... v_prev_bodies[idx] = body; // save a copy in the vector of previous positions v_bodies[idx] = body; // zero the .w component for correct distance calculation