cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ivalylo
Journeyman III

Global Memory read/write synchronization

Is there any way to write to a global memory buffer in one work group and read the change in other? I tried with atomic oprerations, but this do not work if I have more then one item per work group. Can someone explain in simple way why is such pain to do global synchronization. It looks imposible to me resolving a global problem using the GPU. When the problem can't be separated in small separate groups, what is the way to go?

0 Likes
14 Replies
dravisher
Journeyman III

OpenCL does not support global synchronization, the only way to have global synchronization (if there's more than one work-group) is to end the current kernel and launch a new one. So I guess the general solution to a global problem is to break it into many small kernels. If most of your data resides in global memory anyway I think there's a decent chance it will work out well too.

AMD's GPUs does have a global data share, but I haven't read about any plans to add an extension for it.

0 Likes
tanq
Journeyman III

Looks like you need some of "mem_fence" functions, look into OpenCL spec.

0 Likes

This is not possible without a global synchronization primitive, which OpenCL does not expose. This can be done with atomics on Evergreen or later chips, but limits the global range to #SIMD work groups.
0 Likes

Originally posted by: MicahVillmow This is not possible without a global synchronization primitive, which OpenCL does not expose. This can be done with atomics on Evergreen or later chips, but limits the global range to #SIMD work groups.


Could you provide same example of this technique. And describe limits more specific?

0 Likes

.

0 Likes

memory pooling

0 Likes

.

0 Likes

.

0 Likes

sorry something wrong with the browser

0 Likes

Originally posted by: MicahVillmow This is not possible without a global synchronization primitive, which OpenCL does not expose. This can be done with atomics on Evergreen or later chips, but limits the global range to #SIMD work groups.


10x, it's good to know that it's doable, although not very practical. If we take for example 5870, it has 20 SIMD units with 16 micoprocessors, 5 floats wide each. So, I can have safely global range of 20, right?

0 Likes

well you can have safe 20 workgroups. so with workgroup size 64 you get 20x64 1280 global size.

0 Likes

In fact what you can do is have as many work groups as you know can run concurrently (not in parallel, ie you can have more than 20 waves) on the hardware. Unfortunately OpenCL has no way to simply issue that much work (it is a feature I would like to have). You can work it out, though, based on how much LDS you are using per group and how many threads per wave. You may find you can have 2, 4 waves per core and hence 40 or 80 waves in flight on the device.

The important thing is that you have no more trying to synchronise than can be executed concurrently: the first waves have to finish before any new ones can be issued, so if there is a dependence across that boundary your application will deadlock.

In OpenCL hasn't really been designed with this kind of model in mind, although from the point of view of the hardware there's nothing obviously wrong with it barring a lack of coherent caches to reduce communication latency. You have to be very careful with your use of fences and atomics to ensure that communication you do perform is correct. Think carefully about whether you really need to do this or whether you're better off letting the drive handle synchronisation.

0 Likes

@lee...  Yup.. thats the way to go about it...

Since you require global synchronization,  You are probably not going to be satisfied with the hardware scheduler, co it does not expose any such feature... This is not a limitation of the hardware however, as you can launch enough threads to fill out the cores and implement custom queuing using atomic ops.....

You can try implementing different queuing schemes... Just make sure that the atomic ops that you use for synching are per wavefront and not per thread... synching between threads within a wavefront should be avoided for obvious reasons..

 

Debdatta Basu

0 Likes

A good thought, the atomics are less likely to collide that way. This is partly why my programming advice now always considers a wavefront a thread and suggests designing algorithms from that up rather than trying to pretend that the machine is running 64 threads at once and trying to group them together to avoid divergence etc between pretend threads.

0 Likes