Hi,
What does barrier do beyond what mem_fence already does?
From what I understand it looks like mem_fence allows the Kernel execution to continue beyond the mem_fence UNTIL it reaches a load/store operation... at which point it blocks for all pre-mem_fence work-group loads/stores to complete before continuing.
And barrier is even more "strict", as it blocks ALL execution (including but not limited to loads/stores) until all work-group kernels reach the barrier.
Is this correct?
Also, does write_mem_fence:
Sorry for the onslaught of questions, and thanks for all the help so far.
Alex
Thanks Micah,
So if I understand this...
All global stores/loads before a mem_fence(global) call are guaranteed to complete before any global stores/loads after the mem_fence(global) call can start
Is this correct?
If so it will be a welcome replacement to my barrier
Alex
Thanks for the discussion. I also have a few more questions.
1. In AMD's "porting from CUDA" page, it is said that barrier() corresponds to CUDA __syncthread() while mem_fence() corresponds to __threadfence(). Is this a precise equivalence, or just "roughly" comparable?
2. Is it true that calling mem_fence() on on global memory (CLK_GLOBAL_MEM_FENCE) will ensure load/store ordering across all work-items in all work-groups? In other words, global mem_fence provides a mechanism for communication across work-groups?
3. On p.199 of OpenCL spec 1.0.48, it is said that the "barrier function also queues a memory fence (reads and writes) to ensure correct ordering of memory operations." If the barrier is called with CLK_GLOBAL_MEM_FENCE, does it also synchronize the memory operations across work-groups?
Thanks in advance!
EDIT:
In particular, it is possible that a barrier will synchronize read/write to global memory only within a work-group. From your explanation above, it appears that a mem_fence to global memory will guarantee ordering across work-groups. Is this how they (barrier vs mem_fence) differ with respect to memory operations?
Thanks again!
no barrier and mem_fence synchronize only across one work-group.
Originally posted by: nou no barrier and mem_fence synchronize only across one work-group.
Thanks for the quick reply.
But according to CUDA 2.2, __threadfence() "waits until all global memory accesses made by the calling thread prior to __threadfence() are visible to all threads in the device for global memory ..."
That's why I asked whether the correspondence of OpenCL mem_fence() to CUDA __threadfence() is a precise one?
OCL spec 3.3.1
Global memory is consistent across
work-items in a single work-group at a work-group barrier, but there are no guarantees of
memory consistency between different work-groups executing a kernel.
it is posible that nVidia translate barrier() as __syncthread() and mem_fence() as __threadfence().
Originally posted by: nou OCL spec 3.3.1
Global memory is consistent across work-items in a single work-group at a work-group barrier, but there are no guarantees of memory consistency between different work-groups executing a kernel.
it is posible that nVidia translate barrier() as __syncthread() and mem_fence() as __threadfence().
Again, thanks for your replay. I apologize for my insistence because this fine point is important for the correctness of some kernels when translating from CUDA to OpenCL.
So can we say this:
* In OpenCL 1.0, mem_fence() affects only work-items in the same work-group, even when read/write to global memory.
* In CUDA 2.2, __thread_fence() affects work-items across the entire device when read/write to global memory.
If that's the case then mem_fence() and __thread_fence() are semantically different; then is it still possible to translate a CUDA program with __thread_fence() to OpenCL?
no. in opencl global synchronization are on kernel run level. so you must run multiple kernel. i think that similiar it is similiar in CUDA when compiler broke automaticly kernel on __thread_fence() as border.
Originally posted by: nou no. in opencl global synchronization are on kernel run level. so you must run multiple kernel. i think that similiar it is similiar in CUDA when compiler broke automaticly kernel on __thread_fence() as border.
Are you agreeing with what I said, that the mem_fence() in OpenCL and __threadfence() in CUDA are semantically different? Or are they semantically the same?
Please note that the programming language semantics should be independent of how the source code is compiled into binaries supported by the hardware.
Thanks again for your reply. I really appreciate it.
important is what say specification. i do not know CUDA so i cant say. but it seem that __threadfence() is similiar to mem_fence(). but __threadfence() can be stronger than mem_fence() in term of scope.
synchronization with barrier() and mem_fence() is only betwen work-items in work-group. global synchronization is on kernel execution level. that all what i can say.
Originally posted by: edward_yang Originally posted by: nou no. in opencl global synchronization are on kernel run level. so you must run multiple kernel. i think that similiar it is similiar in CUDA when compiler broke automaticly kernel on __thread_fence() as border.
Are you agreeing with what I said, that the mem_fence() in OpenCL and __threadfence() in CUDA are semantically different? Or are they semantically the same?
Please note that the programming language semantics should be independent of how the source code is compiled into binaries supported by the hardware.
Thanks again for your reply. I really appreciate it.
Edward_yarg,
You are right that mem_fence() and _threadfence() are semantically different. mem_fence(GLOBAL | LOCAL) and _threadfence_block() are semantically same.
Thank you.
Maybe AMD should change the description on the CUDA porting page? It is a bit misleading as it is now. Thanks.
Originally posted by: edward_yang Thank you.
Maybe AMD should change the description on the CUDA porting page? It is a bit misleading as it is now. Thanks.
Thanks for finding this. We reported to doc writer.
sorry that I do not think so.
IMHO, mem_fence does not indict what the scope it will impact, in OpenCL spec. i.e. mem_fence just guarantee the mem operation order of the calling thread, it does not block/sync other threads. so the visibility scope of it should be the same as the visibility of memory space it W/R.
so can we say
mem_fence(LOCAL) == __threadfence_block()
mem_fence(GLOBAL) == __threadfence() ?
That would be my understanding, yes. Fences provide no synchronisation. When it says "visible to all threads in the device" I read that the same way I read mem_fence(..GLOBAL..) which is:
Once the fence operation completes any writes to global memory made prior to the fence by this thread are guaranteed to have committed to memory. Therefore any reads of that address initiated from this point on by any thread will read the new value.
What you have no control of is when other threads will read the value, which is where a fence differs from a barrier. As far as I understand, mem_fence(..LOCAL..) is actually a noop when compiled for the AMD GPUs because LDS reads and writes instantly commit.
what about mem_fence(GLOBAL)? does it has some cache coherence problems here?
Originally posted by: LeeHowes As far as I understand, mem_fence(..LOCAL..) is actually a noop when compiled for the AMD GPUs because LDS reads and writes instantly commit.
If you use a larger than 64 workgroup size on Cypress, say, then you will see a GROUP_BARRIER in the ISA. Since this larger workgroup size spans hardware threads, the execution order of LDS reads and writes is no longer guaranteed amongst these hardware threads, so the barrier is required.
Are you saying the fence generates a barrier? If that's true then we're back to the assumption for consistency that a global fence should generate a global barrier. I don't think either is necessary in the definition of a memory fence.
I haven't checked what our implementation does, so I shall leave this to Micah to clarify as he's a compiler person.
Originally posted by: MicahVillmowThe only exception is synchronization on the local address space by work-items in a work group via the barrier instruction.
Yes, that's all I was referring to, that local memory writes must invoke a barrier in ISA when the workgroup size is larger than the hardware thread size. This barrier isn't required when the workgroup size matches the hardware thread size, which is cool because in ISA it means a write to local in one instruction can be immediately followed by a read in the next instruction... So, that's effectively synch-less local memory.
I wasn't making a comment on fence in general, merely that local doesn't always "instantly commit" as Lee was suggesting.
I wasn't aware of the errant behaviour in SDK 2.1! I suppose that behaviour is required for Direct Compute, but I'm not sure.
According to opencl specs mem_fence should work with image load operations. In ATI OpenCL it doesn't. CAL compiler can move those around ... .