I am not sure, if i understand the question correctly.
But from what i understand, you seem to be interested in how global barriers work on AMD GPUs. And how can they sych properly.
OpenCL specification do not specify any global synchronization. barrier are only applicable at work-group level sychronization. As a workgroup is either completely in "wait" state or in "active" state, the thread starving scenario you explained above does not happen.
1 of 1 people found this helpful
Also there is a limit on workgroup size for every device. You can check that in clinfo or query that using clGetDeviceInfo() API.
COMPLETE workgroup is dispatched at once to a Compute Unit. Some wavefronts maybe idle while the other is computing. complete workgroup is dispatched for workgroup level synchronization like barrier() to work. It doesn't happen like a part of workgroup is dispached. This is one of the reason for the limit on workgroup size.
1 of 1 people found this helpful
Regardless of whether CLK_LOCAL_MEM_FENCE (or) CLK_GLOBAL_MEM_FENCE is used - Barrier only necessitates work-items inside a workgroup to synchronize. There is no global execution barrier. The LOCAL and GLOBAL flags are applicable only for the "memory fence" operation introduced by BARRIER.
So BARRIER does 2 things (please read the opencl spec as well):
1. Serve as execution barrier for all workitems inside a workgroup
2. Introduce a memory fence operation - either for local memory (or) for global memory as requested by the programmer
The memory barrier makes sure that all pending writes to the memory are completed before any read/writes are allowed.
OpenCL Spec says this about CLK_GLOBAL_MEM_FENCE:
This can be useful when work-items, for example, write to buffer or image objects and then want to read the updated data
Hope this helps
Thanks you all to correct my understanding of the barrier(CLK_GLOBAL_MEM_FENCE) and i now see, why i misunderstandet it:
"This can be useful when work-items, for example, write to buffer or image objects and then want to read the updated data"
When all workgroups are working in the same buffer object and they are writing to other locations, than they read after it. If that is true, than the statement above is not right, not?
I consider the following situation:
I have a global mem buffer which should be initialized regarding to good mem performance from the threads in parallel, even from parallel workgroups. But than i have to ensure that all threads, which are part of the initialization, are finished, before i continue working on this buffer.
But now i see that the barrier only does sync inside the workgroup, and so the simple barrier(CLK_GLOBAL_MEM_FENCE) does not solve this problem, right? Do you think, that this will be possible to utilize much workgroups without risking a deadlock?
One solution i see will be to use the first workgroup to initialize it alone and write a finish flag into a global buffer, where the others are waiting for.
Do i think to complicated or is this problem really not trivial?
This would work(although highly not recommended) if you assume first workgroup was scheduled to launch in the beginning.
You never know, different runtimes might schedule the workgroups differently and you might endup in a deadlock.
Querying global memory in a while loop would be a serious performance issue.
Another way for global workgroup sync would be to use multiple kernel launches. Kernel launches serve as sync points.
Your understanding and rationle are correct. But you have to note that BARRIER itself is described only for workitems belonging to a workgroup. There are no inter-workgroup relations that are considered.
As you rightly put, One really cannot use this global-fence-barrier for global synchronization i.e. one cannot assume that "all" workitems in the kernel would have written what they wanted to write.
Simply because "all" of them don't execute simultanesouly (scheduling of workgroups is typically in batches - one batch after another).
Also, not "all" workgroups of the kernel would execute that barrier at the same time. So, this should not be used for any inter "workgroup" communication.
The only place where inter-workgroup communication would work is probably while "searching".
Say all workgroups are searching for some solution and 1 work-group which finds the solution has to signal a "FLAG" to break others from searching, this global fence would be handy. The fence must be used by all workgroups both during reading and writing the FLAG.
The example described in the spec (to read back what was written earlier) probably holds good for intra-workgroup. Thats what one can assume.
Thanks you all for your further advise.
As I was replying to another thread, I thought that is applicable to your thread as well.
Please check "Memory consistency" section in OpenCL spec.
This is how it begins....
3.3.1 Memory Consistency
OpenCL uses a relaxed consistency memory model; i.e. the state of memory visible to a workitem
is not guaranteed to be consistent across the collection of work-items at all times.
Within a work-item memory has load / store consistency. Local memory is consistent across
work-items in a single work-group at a work-group barrier. Global memory is consistent across
work-items in a single work-group at a work-group barrier, but there are no guarantees of
memory consistency between different work-groups executing a kernel.