AnsweredAssumed Answered

The definition about clk_global_mem_fence and mem_fence and their effect upon performance

Question asked by cocular on Jun 16, 2013
Latest reply on Jul 1, 2013 by cocular


  Recently I want to implement a priority queue in OpenCL and and some doubt about barrier and mem_fence.  Here is my understanding.

  1. barrier(clk_global_mem_fence):
    1. It makes sure all the work-items in same work-groups reach this barrier
    2. It makes sure that all the write to global memory in current work-item can be read correctly by other work-item in the same work-group after the barrier.
  2. mem_fence:
    1. It makes sure that all the write in current work-item can be correctly read by the this work-item after the fence

Do I miss something?  Am I right?


Now for the performance issue.  I read the AMD Accelerated Parallel Processing OpenCL Programming Guide.  On page 136, there is a cache hierarchical figure.  I see that there is a L1 cache per compute unit.  Since one work-group can only be fitted in one compute unit, In a global barrier, GPU does not need wait 400+cycle to make sure the write done in the global memory but only want so that write on L1 completes?


If that is the case, what is the latency if L1 hits? Or L2 hits?