22 Replies Latest reply on Jul 5, 2010 6:41 AM by hazeman

    barrier vs mem_fence?

    alexaverbuch

      Hi,

      What does barrier do beyond what mem_fence already does?

      From what I understand it looks like mem_fence allows the Kernel execution to continue beyond the mem_fence UNTIL it reaches a load/store operation... at which point it blocks for all pre-mem_fence work-group loads/stores to complete before continuing.

      And barrier is even more "strict", as it blocks ALL execution (including but not limited to loads/stores) until all work-group kernels reach the barrier.

      Is this correct?

      Also, does write_mem_fence:

      1. Wait for all pre-mem_fence stores to complete before allowing future stores? OR
      2. Block on post-mem_fence stores until ALL pre-mem_fence operations have completed? OR
      3. Something else

      Sorry for the onslaught of questions, and thanks for all the help so far.

      Alex

        • barrier vs mem_fence?
          MicahVillmow
          mem_fence does not cause execution to stop at that point, only that memory operations will not get reordered around the fence instruction. A barrier gaurantees that all work-items reach that point before any work-item moves to the next instruction
            • barrier vs mem_fence?
              alexaverbuch

              Thanks Micah,

              So if I understand this...

              All global stores/loads before a mem_fence(global) call are guaranteed to complete before any global stores/loads after the mem_fence(global) call can start

              Is this correct?

              If so it will be a welcome replacement to my barrier

              Alex

            • barrier vs mem_fence?
              MicahVillmow
              yes that is correct.
                • barrier vs mem_fence?
                  edward_yang

                  Thanks for the discussion. I also have a few more questions.

                  1. In AMD's "porting from CUDA" page, it is said that barrier() corresponds to CUDA __syncthread() while mem_fence() corresponds to __threadfence(). Is this a precise equivalence, or just "roughly" comparable?

                  2. Is it true that calling mem_fence() on on global memory (CLK_GLOBAL_MEM_FENCE) will ensure load/store ordering across all work-items in all work-groups? In other words, global mem_fence provides a mechanism for communication across work-groups?

                  3. On p.199 of OpenCL spec 1.0.48, it is said that the "barrier function also queues a memory fence (reads and writes) to ensure correct ordering of memory operations." If the barrier is called with CLK_GLOBAL_MEM_FENCE, does it also synchronize the memory operations across work-groups?

                  Thanks in advance!

                  EDIT:

                  In particular, it is possible that a barrier will synchronize read/write to global memory only within a work-group. From your explanation above, it appears that a mem_fence to global memory will guarantee ordering across work-groups. Is this how they (barrier vs mem_fence) differ with respect to memory operations?

                  Thanks again!

                    • barrier vs mem_fence?
                      nou

                      no barrier and mem_fence synchronize only across one work-group.

                        • barrier vs mem_fence?
                          edward_yang

                           

                          Originally posted by: nou no barrier and mem_fence synchronize only across one work-group.

                           

                          Thanks for the quick reply.

                          But according to CUDA 2.2, __threadfence() "waits until all global memory accesses made by the calling thread prior to __threadfence() are visible to all threads in the device for global memory ..."

                          That's why I asked whether the correspondence of OpenCL mem_fence() to CUDA __threadfence() is a precise one?

                            • barrier vs mem_fence?
                              nou

                              OCL spec 3.3.1

                              Global memory is consistent across
                              work-items in a single work-group at a work-group barrier, but there are no guarantees of
                              memory consistency between different work-groups executing a kernel.

                              it is posible that nVidia translate barrier() as __syncthread() and mem_fence() as __threadfence().

                                • barrier vs mem_fence?
                                  edward_yang

                                   

                                  Originally posted by: nou OCL spec 3.3.1

                                   

                                  Global memory is consistent across work-items in a single work-group at a work-group barrier, but there are no guarantees of memory consistency between different work-groups executing a kernel.

                                   

                                  it is posible that nVidia translate barrier() as __syncthread() and mem_fence() as __threadfence().

                                   

                                  Again, thanks for your replay. I apologize for my insistence because this fine point is important for the correctness of some kernels when translating from CUDA to OpenCL.

                                  So can we say this:

                                  * In OpenCL 1.0, mem_fence() affects only work-items in the same work-group, even when read/write to global memory.

                                  * In CUDA 2.2, __thread_fence() affects work-items across the entire device when read/write to global memory.

                                  If that's the case then mem_fence() and __thread_fence() are semantically different; then is it still possible to translate a CUDA program with __thread_fence() to OpenCL?

                                    • barrier vs mem_fence?
                                      nou

                                      no. in opencl global synchronization are on kernel run level. so you must run multiple kernel. i think that similiar it is similiar in CUDA when compiler broke automaticly kernel  on __thread_fence() as border.

                                        • barrier vs mem_fence?
                                          edward_yang

                                           

                                          Originally posted by: nou no. in opencl global synchronization are on kernel run level. so you must run multiple kernel. i think that similiar it is similiar in CUDA when compiler broke automaticly kernel  on __thread_fence() as border.


                                          Are you agreeing with what I said, that the mem_fence() in OpenCL and __threadfence() in CUDA are semantically different? Or are they semantically the same?

                                          Please note that the programming language semantics should be independent of how the source code is compiled into binaries supported by the hardware.

                                          Thanks again for your reply. I really appreciate it.

                                            • barrier vs mem_fence?
                                              nou

                                              important is what say specification. i do not know CUDA so i cant say. but it seem that __threadfence() is similiar to mem_fence(). but __threadfence() can be stronger than mem_fence() in term of scope.

                                              synchronization with barrier() and mem_fence() is only betwen work-items in work-group. global synchronization is on kernel execution level. that all what i can say.

                                              • barrier vs mem_fence?
                                                genaganna

                                                 

                                                Originally posted by: edward_yang
                                                Originally posted by: nou no. in opencl global synchronization are on kernel run level. so you must run multiple kernel. i think that similiar it is similiar in CUDA when compiler broke automaticly kernel  on __thread_fence() as border.


                                                 

                                                Are you agreeing with what I said, that the mem_fence() in OpenCL and __threadfence() in CUDA are semantically different? Or are they semantically the same?

                                                 

                                                Please note that the programming language semantics should be independent of how the source code is compiled into binaries supported by the hardware.

                                                 

                                                Thanks again for your reply. I really appreciate it.

                                                 

                                                Edward_yarg,

                                                                    You are right that mem_fence()  and _threadfence() are semantically different. mem_fence(GLOBAL | LOCAL) and _threadfence_block() are semantically same.

                                                  • barrier vs mem_fence?
                                                    edward_yang

                                                    Thank you.

                                                    Maybe AMD should change the description on the CUDA porting page? It is a bit misleading as it is now. Thanks.

                                                      • barrier vs mem_fence?
                                                        genaganna

                                                         

                                                        Originally posted by: edward_yang Thank you.

                                                         

                                                        Maybe AMD should change the description on the CUDA porting page? It is a bit misleading as it is now. Thanks.

                                                         

                                                        Thanks for finding this. We reported to doc writer.

                                                        • barrier vs mem_fence?
                                                          probing

                                                          sorry that I do not think so.

                                                          IMHO, mem_fence does not indict what the scope it will impact, in OpenCL spec. i.e. mem_fence just guarantee the mem operation order of the calling thread,  it does not block/sync other threads. so the visibility scope of it should be the same as the visibility of memory space it W/R. 

                                                          so can we say

                                                          mem_fence(LOCAL) == __threadfence_block()

                                                          mem_fence(GLOBAL) == __threadfence() ?

                                                           

                                                            • barrier vs mem_fence?
                                                              LeeHowes

                                                              That would be my understanding, yes. Fences provide no synchronisation. When it says "visible to all threads in the device"  I read that the same way I read mem_fence(..GLOBAL..) which is:

                                                               

                                                               

                                                              Once the fence operation completes any writes to global memory made prior to the fence by this thread are guaranteed to have committed to memory. Therefore any reads of that address initiated from this point on by any thread will read the new value.


                                                               

                                                              What you have no control of is when other threads will read the value, which is where a fence differs from a barrier. As far as I understand, mem_fence(..LOCAL..) is actually a noop when compiled for the AMD GPUs because LDS reads and writes instantly commit.

                                                                • barrier vs mem_fence?
                                                                  probing

                                                                  what about mem_fence(GLOBAL)? does it has some cache coherence problems here?

                                                                  • barrier vs mem_fence?
                                                                    Jawed

                                                                     

                                                                    Originally posted by: LeeHowes As far as I understand, mem_fence(..LOCAL..) is actually a noop when compiled for the AMD GPUs because LDS reads and writes instantly commit.


                                                                    If you use a larger than 64 workgroup size on Cypress, say, then you will see a GROUP_BARRIER in the ISA. Since this larger workgroup size spans hardware threads, the execution order of LDS reads and writes is no longer guaranteed amongst  these hardware threads, so the barrier is required.

                                                                      • barrier vs mem_fence?
                                                                        LeeHowes

                                                                        Are you saying the fence generates a barrier? If that's true then we're back to the assumption for consistency that a global fence should generate a global barrier. I don't think either is necessary in the definition of a memory fence.

                                                                        I haven't checked what our implementation does, so I shall leave this to Micah to clarify as he's a compiler person.

                                              • barrier vs mem_fence?
                                                MicahVillmow
                                                Jawed,
                                                That is a bug in SDK 2.1 and is fixed in the upcoming release, fence operations will no longer trigger barrier instructions.

                                                As for a fence operation:
                                                A fence operation instructs the compiler not to reorder any memory instructions around the fence instruction. There is no synchronization done, so on a mem_fence instruction, there is no guarantee that any load/store from another work-item to either local or global memory is visible to the current work-item. The only guarantee of mem_fence is that loads/stores before the fence will be executed before load/stores after the fence. Memory consistency in OpenCL is only guaranteed within a work-item and that work-item is unique in it's access of memory throughout the NDRange. The only exception is synchronization on the local address space by work-items in a work group via the barrier instruction.
                                                  • barrier vs mem_fence?
                                                    Jawed

                                                     

                                                    Originally posted by: MicahVillmowThe only exception is synchronization on the local address space by work-items in a work group via the barrier instruction.


                                                    Yes, that's all I was referring to, that local memory writes must invoke a barrier in ISA when the workgroup size is larger than the hardware thread size. This barrier isn't required when the workgroup size matches the hardware thread size, which is cool because in ISA it means a write to local in one instruction can be immediately followed by a read in the next instruction... So, that's effectively synch-less local memory.

                                                    I wasn't making a comment on fence in general, merely that local doesn't always "instantly commit" as Lee was suggesting.

                                                    I wasn't aware of the errant behaviour in SDK 2.1! I suppose that behaviour is required for Direct Compute, but I'm not sure.