Recently I want to implement a priority queue in OpenCL and and some doubt about barrier and mem_fence. Here is my understanding.
- It makes sure all the work-items in same work-groups reach this barrier
- It makes sure that all the write to global memory in current work-item can be read correctly by other work-item in the same work-group after the barrier.
- It makes sure that all the write in current work-item can be correctly read by the this work-item after the fence
Do I miss something? Am I right?
Now for the performance issue. I read the AMD Accelerated Parallel Processing OpenCL Programming Guide. On page 136, there is a cache hierarchical figure. I see that there is a L1 cache per compute unit. Since one work-group can only be fitted in one compute unit, In a global barrier, GPU does not need wait 400+cycle to make sure the write done in the global memory but only want so that write on L1 completes?
If that is the case, what is the latency if L1 hits? Or L2 hits?