cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

lantonov
Journeyman III

Better explanation for mem_fence?

So I've read the Khronos docs over and over again and I still can't figure out when it is the correct situation to use a mem_fence. The explanation is somewhat imprecise: I understand that it tells the compiler not to reorder memory loads and stores across the fence, BUT:

1. Which loads and stores should be a concern? For example, it seems reasonable that the following could be problematic (Listing 1 - it's my code). I am loading from a global, modifying, then storing back. But, is it truly expected that the compiler could reorder these loads and stores so they could happen in the opposite order?

2. In the same listing, is it only the original global that is of concern (if at all), or is the second global a problem too. If so, then it seems like all of our code should be littered with mem_fences, since every kernel pretty much boils down to load from global, do stuff, store to global.

3. Aside from that, it is clear to me what mem_fence does. But what about its little cousins read_mem_fence and write_mem_fence? What exactly are they preventing the compiler from doing? Is read_mem_fence making sure that all reads happen in the order declared? What is an example of this being a problem? Ditto for the write_ function.

I'd welcome an explanation from somebody who understands them better than I.

I would love to see example code that has a problem that is solved with any of the mem_fence functions.

Thanks!

 

Listing 1: float4 body = v_bodies[idx]; body.w = 0.0f; ... v_prev_bodies[idx] = body; // save a copy in the vector of previous positions v_bodies[idx] = body; // zero the .w component for correct distance calculation

0 Likes
3 Replies

A compiler cannot move stores across a load that it depends upon, but it can move non-dependent loads across stores and vice-versa.

You want to use read_mem_fence and write_mem_fence when you want to guarantee that loads/stores occur in a very specific order. This usually is the case when you are mapping your load/store patterns to a single architectures memory layout and any re-ordering by the compiler can cause performance drops by causing data to be read into caches at the wrong times or by causing bank conflicts.
0 Likes

Thanks, now it makes sense for the read_ and write_ functions.

For the regular mem_fence, it seems that the only way to get the into a situation the requires it, is if you use some kind of casting or pointer magic to access memory. Otherwise the compiler should be able to track the dependencies. Is that correct?

If you had an example of some bad code, it would certainly help illustrate this...

Thanks!

0 Likes

it actually is quite simple, you have two pointers, a and b.

c = load a
d = load b
... do some math on d
store d at b + offset
... do some math on c
store c at a + offset

Without the mem fence, the compiler is free to move the store to a before the load from b so that the code looks like this:
c = load a
... do some math on c
store c at a + offset
d = load b
... do some math on d
store d at b + offset

Which might not be what you want. Also the inverse is also true.
0 Likes