cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

alexaverbuch
Journeyman III

barrier vs mem_fence?

Hi,

What does barrier do beyond what mem_fence already does?

From what I understand it looks like mem_fence allows the Kernel execution to continue beyond the mem_fence UNTIL it reaches a load/store operation... at which point it blocks for all pre-mem_fence work-group loads/stores to complete before continuing.

And barrier is even more "strict", as it blocks ALL execution (including but not limited to loads/stores) until all work-group kernels reach the barrier.

Is this correct?

Also, does write_mem_fence:

  1. Wait for all pre-mem_fence stores to complete before allowing future stores? OR
  2. Block on post-mem_fence stores until ALL pre-mem_fence operations have completed? OR
  3. Something else

Sorry for the onslaught of questions, and thanks for all the help so far.

Alex

0 Likes
22 Replies

mem_fence does not cause execution to stop at that point, only that memory operations will not get reordered around the fence instruction. A barrier gaurantees that all work-items reach that point before any work-item moves to the next instruction
0 Likes

Thanks Micah,

So if I understand this...

All global stores/loads before a mem_fence(global) call are guaranteed to complete before any global stores/loads after the mem_fence(global) call can start

Is this correct?

If so it will be a welcome replacement to my barrier

Alex

0 Likes

yes that is correct.
0 Likes

Thanks for the discussion. I also have a few more questions.

1. In AMD's "porting from CUDA" page, it is said that barrier() corresponds to CUDA __syncthread() while mem_fence() corresponds to __threadfence(). Is this a precise equivalence, or just "roughly" comparable?

2. Is it true that calling mem_fence() on on global memory (CLK_GLOBAL_MEM_FENCE) will ensure load/store ordering across all work-items in all work-groups? In other words, global mem_fence provides a mechanism for communication across work-groups?

3. On p.199 of OpenCL spec 1.0.48, it is said that the "barrier function also queues a memory fence (reads and writes) to ensure correct ordering of memory operations." If the barrier is called with CLK_GLOBAL_MEM_FENCE, does it also synchronize the memory operations across work-groups?

Thanks in advance!

EDIT:

In particular, it is possible that a barrier will synchronize read/write to global memory only within a work-group. From your explanation above, it appears that a mem_fence to global memory will guarantee ordering across work-groups. Is this how they (barrier vs mem_fence) differ with respect to memory operations?

Thanks again!

0 Likes

no barrier and mem_fence synchronize only across one work-group.

0 Likes

Originally posted by: nou no barrier and mem_fence synchronize only across one work-group.

 

Thanks for the quick reply.

But according to CUDA 2.2, __threadfence() "waits until all global memory accesses made by the calling thread prior to __threadfence() are visible to all threads in the device for global memory ..."

That's why I asked whether the correspondence of OpenCL mem_fence() to CUDA __threadfence() is a precise one?

0 Likes

OCL spec 3.3.1

Global memory is consistent across
work-items in a single work-group at a work-group barrier, but there are no guarantees of
memory consistency between different work-groups executing a kernel.

it is posible that nVidia translate barrier() as __syncthread() and mem_fence() as __threadfence().

0 Likes

Originally posted by: nou OCL spec 3.3.1

 

Global memory is consistent across work-items in a single work-group at a work-group barrier, but there are no guarantees of memory consistency between different work-groups executing a kernel.

 

it is posible that nVidia translate barrier() as __syncthread() and mem_fence() as __threadfence().

 

Again, thanks for your replay. I apologize for my insistence because this fine point is important for the correctness of some kernels when translating from CUDA to OpenCL.

So can we say this:

* In OpenCL 1.0, mem_fence() affects only work-items in the same work-group, even when read/write to global memory.

* In CUDA 2.2, __thread_fence() affects work-items across the entire device when read/write to global memory.

If that's the case then mem_fence() and __thread_fence() are semantically different; then is it still possible to translate a CUDA program with __thread_fence() to OpenCL?

0 Likes

no. in opencl global synchronization are on kernel run level. so you must run multiple kernel. i think that similiar it is similiar in CUDA when compiler broke automaticly kernel  on __thread_fence() as border.

0 Likes

Originally posted by: nou no. in opencl global synchronization are on kernel run level. so you must run multiple kernel. i think that similiar it is similiar in CUDA when compiler broke automaticly kernel  on __thread_fence() as border.


Are you agreeing with what I said, that the mem_fence() in OpenCL and __threadfence() in CUDA are semantically different? Or are they semantically the same?

Please note that the programming language semantics should be independent of how the source code is compiled into binaries supported by the hardware.

Thanks again for your reply. I really appreciate it.

0 Likes

important is what say specification. i do not know CUDA so i cant say. but it seem that __threadfence() is similiar to mem_fence(). but __threadfence() can be stronger than mem_fence() in term of scope.

synchronization with barrier() and mem_fence() is only betwen work-items in work-group. global synchronization is on kernel execution level. that all what i can say.

0 Likes

Originally posted by: edward_yang
Originally posted by: nou no. in opencl global synchronization are on kernel run level. so you must run multiple kernel. i think that similiar it is similiar in CUDA when compiler broke automaticly kernel  on __thread_fence() as border.


 

Are you agreeing with what I said, that the mem_fence() in OpenCL and __threadfence() in CUDA are semantically different? Or are they semantically the same?

 

Please note that the programming language semantics should be independent of how the source code is compiled into binaries supported by the hardware.

 

Thanks again for your reply. I really appreciate it.

 

Edward_yarg,

                    You are right that mem_fence()  and _threadfence() are semantically different. mem_fence(GLOBAL | LOCAL) and _threadfence_block() are semantically same.

0 Likes

Thank you.

Maybe AMD should change the description on the CUDA porting page? It is a bit misleading as it is now. Thanks.

0 Likes

Originally posted by: edward_yang Thank you.

 

Maybe AMD should change the description on the CUDA porting page? It is a bit misleading as it is now. Thanks.

 

Thanks for finding this. We reported to doc writer.

0 Likes

sorry that I do not think so.

IMHO, mem_fence does not indict what the scope it will impact, in OpenCL spec. i.e. mem_fence just guarantee the mem operation order of the calling thread,  it does not block/sync other threads. so the visibility scope of it should be the same as the visibility of memory space it W/R. 

so can we say

mem_fence(LOCAL) == __threadfence_block()

mem_fence(GLOBAL) == __threadfence() ?

 

0 Likes

That would be my understanding, yes. Fences provide no synchronisation. When it says "visible to all threads in the device"  I read that the same way I read mem_fence(..GLOBAL..) which is:

 

Once the fence operation completes any writes to global memory made prior to the fence by this thread are guaranteed to have committed to memory. Therefore any reads of that address initiated from this point on by any thread will read the new value.


 

What you have no control of is when other threads will read the value, which is where a fence differs from a barrier. As far as I understand, mem_fence(..LOCAL..) is actually a noop when compiled for the AMD GPUs because LDS reads and writes instantly commit.

0 Likes

what about mem_fence(GLOBAL)? does it has some cache coherence problems here?

0 Likes

Originally posted by: LeeHowes As far as I understand, mem_fence(..LOCAL..) is actually a noop when compiled for the AMD GPUs because LDS reads and writes instantly commit.


If you use a larger than 64 workgroup size on Cypress, say, then you will see a GROUP_BARRIER in the ISA. Since this larger workgroup size spans hardware threads, the execution order of LDS reads and writes is no longer guaranteed amongst  these hardware threads, so the barrier is required.

0 Likes

Are you saying the fence generates a barrier? If that's true then we're back to the assumption for consistency that a global fence should generate a global barrier. I don't think either is necessary in the definition of a memory fence.

I haven't checked what our implementation does, so I shall leave this to Micah to clarify as he's a compiler person.

0 Likes

Jawed,
That is a bug in SDK 2.1 and is fixed in the upcoming release, fence operations will no longer trigger barrier instructions.

As for a fence operation:
A fence operation instructs the compiler not to reorder any memory instructions around the fence instruction. There is no synchronization done, so on a mem_fence instruction, there is no guarantee that any load/store from another work-item to either local or global memory is visible to the current work-item. The only guarantee of mem_fence is that loads/stores before the fence will be executed before load/stores after the fence. Memory consistency in OpenCL is only guaranteed within a work-item and that work-item is unique in it's access of memory throughout the NDRange. The only exception is synchronization on the local address space by work-items in a work group via the barrier instruction.
0 Likes

Originally posted by: MicahVillmowThe only exception is synchronization on the local address space by work-items in a work group via the barrier instruction.


Yes, that's all I was referring to, that local memory writes must invoke a barrier in ISA when the workgroup size is larger than the hardware thread size. This barrier isn't required when the workgroup size matches the hardware thread size, which is cool because in ISA it means a write to local in one instruction can be immediately followed by a read in the next instruction... So, that's effectively synch-less local memory.

I wasn't making a comment on fence in general, merely that local doesn't always "instantly commit" as Lee was suggesting.

I wasn't aware of the errant behaviour in SDK 2.1! I suppose that behaviour is required for Direct Compute, but I'm not sure.

0 Likes

According to opencl specs mem_fence should work with image load operations. In ATI OpenCL it doesn't. CAL compiler can move those around ... .

0 Likes