Archives Discussions

cguenther · ‎01-06-2013

Hi guys,

i hope you can help me with the following hardware depended scheduling problem. The number of active wavefronts of a kernel is limited by various hardware dependencies like the amount of scalar, vector registers or the local memory.

Also a set of threads, which are currently in idle state, consists at the GPU and reservs also hardware units. (Which ones?) This is necessary to schedule them fast in to hide the memory latencies. The threads, which do not fit into the GPU, even in idle mode, gets dispatched to the GPU when others finished theirs work.

So now there i have the understanding problem how the threads gets synchronized at a global barrier, when the threads at the GPU can't continue while the threads which are not yet dispatched to the GPU can't start? Is the consequence that the global barrier is only consistent for all threads living at the GPU, or does the driver do some tricks to schedule all around? I think the second solution would be very slow, if it is really done.

Please correct my understanding of the OpenCL work scheduling with AMD GPUs. I hope i described the problem clearly.

with best regards,

Christian Günther

heman · ‎01-06-2013

Hi,

I am not sure, if i understand the question correctly.

But from what i understand, you seem to be interested in how global barriers work on AMD GPUs. And how can they sych properly.

OpenCL specification do not specify any global synchronization. barrier are only applicable at work-group level sychronization. As a workgroup is either completely in "wait" state or in "active" state, the thread starving scenario you explained above does not happen.

regards

workitem7

View solution in original post

heman · ‎01-06-2013

Hi,

I am not sure, if i understand the question correctly.

But from what i understand, you seem to be interested in how global barriers work on AMD GPUs. And how can they sych properly.

OpenCL specification do not specify any global synchronization. barrier are only applicable at work-group level sychronization. As a workgroup is either completely in "wait" state or in "active" state, the thread starving scenario you explained above does not happen.

regards

workitem7

krrishnarraj · ‎01-06-2013

Also there is a limit on workgroup size for every device. You can check that in clinfo or query that using clGetDeviceInfo() API.

COMPLETE workgroup is dispatched at once to a Compute Unit. Some wavefronts maybe idle while the other is computing. complete workgroup is dispatched for workgroup level synchronization like barrier() to work. It doesn't happen like a part of workgroup is dispached. This is one of the reason for the limit on workgroup size.

developer · ‎01-07-2013

Regardless of whether CLK_LOCAL_MEM_FENCE (or) CLK_GLOBAL_MEM_FENCE is used - Barrier only necessitates work-items inside a workgroup to synchronize. There is no global execution barrier. The LOCAL and GLOBAL flags are applicable only for the "memory fence" operation introduced by BARRIER.

So BARRIER does 2 things (please read the opencl spec as well):

1. Serve as execution barrier for all workitems inside a workgroup

2. Introduce a memory fence operation - either for local memory (or) for global memory as requested by the programmer

The memory barrier makes sure that all pending writes to the memory are completed before any read/writes are allowed.

OpenCL Spec says this about CLK_GLOBAL_MEM_FENCE:

"

This can be useful when work-items, for example, write to buffer or image objects and then want to read the updated data

"

Hope this helps

cguenther · ‎01-07-2013

Thanks you all to correct my understanding of the barrier(CLK_GLOBAL_MEM_FENCE) and i now see, why i misunderstandet it:

"This can be useful when work-items, for example, write to buffer or image objects and then want to read the updated data"

When all workgroups are working in the same buffer object and they are writing to other locations, than they read after it. If that is true, than the statement above is not right, not?

I consider the following situation:

I have a global mem buffer which should be initialized regarding to good mem performance from the threads in parallel, even from parallel workgroups. But than i have to ensure that all threads, which are part of the initialization, are finished, before i continue working on this buffer.

But now i see that the barrier only does sync inside the workgroup, and so the simple barrier(CLK_GLOBAL_MEM_FENCE) does not solve this problem, right? Do you think, that this will be possible to utilize much workgroups without risking a deadlock?

One solution i see will be to use the first workgroup to initialize it alone and write a finish flag into a global buffer, where the others are waiting for.

Do i think to complicated or is this problem really not trivial?

krrishnarraj · ‎01-07-2013

This would work(although highly not recommended) if you assume first workgroup was scheduled to launch in the beginning.

You never know, different runtimes might schedule the workgroups differently and you might endup in a deadlock.

Querying global memory in a while loop would be a serious performance issue.

Another way for global workgroup sync would be to use multiple kernel launches. Kernel launches serve as sync points.

developer · ‎01-07-2013

Hi,

Your understanding and rationle are correct. But you have to note that BARRIER itself is described only for workitems belonging to a workgroup. There are no inter-workgroup relations that are considered.

As you rightly put, One really cannot use this global-fence-barrier for global synchronization i.e. one cannot assume that "all" workitems in the kernel would have written what they wanted to write.

Simply because "all" of them don't execute simultanesouly (scheduling of workgroups is typically in batches - one batch after another).

Also, not "all" workgroups of the kernel would execute that barrier at the same time. So, this should not be used for any inter "workgroup" communication.

The only place where inter-workgroup communication would work is probably while "searching".

Say all workgroups are searching for some solution and 1 work-group which finds the solution has to signal a "FLAG" to break others from searching, this global fence would be handy. The fence must be used by all workgroups both during reading and writing the FLAG.

The example described in the spec (to read back what was written earlier) probably holds good for intra-workgroup. Thats what one can assume.

Best Regards,

Workitem 6

cguenther · ‎01-07-2013

Thanks you all for your further advise.

developer · ‎01-09-2013

As I was replying to another thread, I thought that is applicable to your thread as well.

Please check "Memory consistency" section in OpenCL spec.

This is how it begins....

<<<<<<<<

3.3.1 Memory Consistency

OpenCL uses a relaxed consistency memory model; i.e. the state of memory visible to a workitem

is not guaranteed to be consistent across the collection of work-items at all times.

Within a work-item memory has load / store consistency. Local memory is consistent across

work-items in a single work-group at a work-group barrier. Global memory is consistent across

work-items in a single work-group at a work-group barrier, but there are no guarantees of

memory consistency between different work-groups executing a kernel.

>>>>>>>>>>>

Archives Discussions

Are global barriers save, even with a hughe global work size?