cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

alexaverbuch
Journeyman III

barrier vs mem_fence?

Hi,

What does barrier do beyond what mem_fence already does?

From what I understand it looks like mem_fence allows the Kernel execution to continue beyond the mem_fence UNTIL it reaches a load/store operation... at which point it blocks for all pre-mem_fence work-group loads/stores to complete before continuing.

And barrier is even more "strict", as it blocks ALL execution (including but not limited to loads/stores) until all work-group kernels reach the barrier.

Is this correct?

Also, does write_mem_fence:

  1. Wait for all pre-mem_fence stores to complete before allowing future stores? OR
  2. Block on post-mem_fence stores until ALL pre-mem_fence operations have completed? OR
  3. Something else

Sorry for the onslaught of questions, and thanks for all the help so far.

Alex

0 Likes
22 Replies
MicahVillmow
Staff
Staff

barrier vs mem_fence?

mem_fence does not cause execution to stop at that point, only that memory operations will not get reordered around the fence instruction. A barrier gaurantees that all work-items reach that point before any work-item moves to the next instruction
0 Likes
alexaverbuch
Journeyman III

barrier vs mem_fence?

Thanks Micah,

So if I understand this...

All global stores/loads before a mem_fence(global) call are guaranteed to complete before any global stores/loads after the mem_fence(global) call can start

Is this correct?

If so it will be a welcome replacement to my barrier

Alex

0 Likes
MicahVillmow
Staff
Staff

barrier vs mem_fence?

yes that is correct.
0 Likes
edward_yang
Journeyman III

barrier vs mem_fence?

Thanks for the discussion. I also have a few more questions.

1. In AMD's "porting from CUDA" page, it is said that barrier() corresponds to CUDA __syncthread() while mem_fence() corresponds to __threadfence(). Is this a precise equivalence, or just "roughly" comparable?

2. Is it true that calling mem_fence() on on global memory (CLK_GLOBAL_MEM_FENCE) will ensure load/store ordering across all work-items in all work-groups? In other words, global mem_fence provides a mechanism for communication across work-groups?

3. On p.199 of OpenCL spec 1.0.48, it is said that the "barrier function also queues a memory fence (reads and writes) to ensure correct ordering of memory operations." If the barrier is called with CLK_GLOBAL_MEM_FENCE, does it also synchronize the memory operations across work-groups?

Thanks in advance!

EDIT:

In particular, it is possible that a barrier will synchronize read/write to global memory only within a work-group. From your explanation above, it appears that a mem_fence to global memory will guarantee ordering across work-groups. Is this how they (barrier vs mem_fence) differ with respect to memory operations?

Thanks again!

0 Likes
nou
Exemplar

barrier vs mem_fence?

no barrier and mem_fence synchronize only across one work-group.

0 Likes
edward_yang
Journeyman III

barrier vs mem_fence?

Originally posted by: nou no barrier and mem_fence synchronize only across one work-group.

 

Thanks for the quick reply.

But according to CUDA 2.2, __threadfence() "waits until all global memory accesses made by the calling thread prior to __threadfence() are visible to all threads in the device for global memory ..."

That's why I asked whether the correspondence of OpenCL mem_fence() to CUDA __threadfence() is a precise one?

0 Likes
nou
Exemplar

barrier vs mem_fence?

OCL spec 3.3.1

Global memory is consistent across
work-items in a single work-group at a work-group barrier, but there are no guarantees of
memory consistency between different work-groups executing a kernel.

it is posible that nVidia translate barrier() as __syncthread() and mem_fence() as __threadfence().

0 Likes
edward_yang
Journeyman III

barrier vs mem_fence?

Originally posted by: nou OCL spec 3.3.1

 

Global memory is consistent across work-items in a single work-group at a work-group barrier, but there are no guarantees of memory consistency between different work-groups executing a kernel.

 

it is posible that nVidia translate barrier() as __syncthread() and mem_fence() as __threadfence().

 

Again, thanks for your replay. I apologize for my insistence because this fine point is important for the correctness of some kernels when translating from CUDA to OpenCL.

So can we say this:

* In OpenCL 1.0, mem_fence() affects only work-items in the same work-group, even when read/write to global memory.

* In CUDA 2.2, __thread_fence() affects work-items across the entire device when read/write to global memory.

If that's the case then mem_fence() and __thread_fence() are semantically different; then is it still possible to translate a CUDA program with __thread_fence() to OpenCL?

0 Likes
nou
Exemplar

barrier vs mem_fence?

no. in opencl global synchronization are on kernel run level. so you must run multiple kernel. i think that similiar it is similiar in CUDA when compiler broke automaticly kernel  on __thread_fence() as border.

0 Likes