Archives Discussions

ivalylo · ‎11-16-2010

Is there any way to write to a global memory buffer in one work group and read the change in other? I tried with atomic oprerations, but this do not work if I have more then one item per work group. Can someone explain in simple way why is such pain to do global synchronization. It looks imposible to me resolving a global problem using the GPU. When the problem can't be separated in small separate groups, what is the way to go?

dravisher · ‎11-16-2010

OpenCL does not support global synchronization, the only way to have global synchronization (if there's more than one work-group) is to end the current kernel and launch a new one. So I guess the general solution to a global problem is to break it into many small kernels. If most of your data resides in global memory anyway I think there's a decent chance it will work out well too.

AMD's GPUs does have a global data share, but I haven't read about any plans to add an extension for it.

tanq · ‎11-16-2010

Looks like you need some of "mem_fence" functions, look into OpenCL spec.

MicahVillmow · ‎11-16-2010

This is not possible without a global synchronization primitive, which OpenCL does not expose. This can be done with atomics on Evergreen or later chips, but limits the global range to #SIMD work groups.

zeland · ‎11-16-2010

Originally posted by: MicahVillmow This is not possible without a global synchronization primitive, which OpenCL does not expose. This can be done with atomics on Evergreen or later chips, but limits the global range to #SIMD work groups.

Could you provide same example of this technique. And describe limits more specific?

DTop · ‎11-16-2010

.

DTop · ‎11-16-2010

memory pooling

DTop · ‎11-16-2010

.

DTop · ‎11-16-2010

.

DTop · ‎11-16-2010

sorry something wrong with the browser

ivalylo · ‎11-17-2010

Originally posted by: MicahVillmow This is not possible without a global synchronization primitive, which OpenCL does not expose. This can be done with atomics on Evergreen or later chips, but limits the global range to #SIMD work groups.

10x, it's good to know that it's doable, although not very practical. If we take for example 5870, it has 20 SIMD units with 16 micoprocessors, 5 floats wide each. So, I can have safely global range of 20, right?

nou · ‎11-17-2010

well you can have safe 20 workgroups. so with workgroup size 64 you get 20x64 1280 global size.

LeeHowes · ‎11-17-2010

In fact what you can do is have as many work groups as you know can run concurrently (not in parallel, ie you can have more than 20 waves) on the hardware. Unfortunately OpenCL has no way to simply issue that much work (it is a feature I would like to have). You can work it out, though, based on how much LDS you are using per group and how many threads per wave. You may find you can have 2, 4 waves per core and hence 40 or 80 waves in flight on the device.

The important thing is that you have no more trying to synchronise than can be executed concurrently: the first waves have to finish before any new ones can be issued, so if there is a dependence across that boundary your application will deadlock.

In OpenCL hasn't really been designed with this kind of model in mind, although from the point of view of the hardware there's nothing obviously wrong with it barring a lack of coherent caches to reduce communication latency. You have to be very careful with your use of fences and atomics to ensure that communication you do perform is correct. Think carefully about whether you really need to do this or whether you're better off letting the drive handle synchronisation.

debdatta_basu · ‎11-17-2010

@lee... Yup.. thats the way to go about it...

Since you require global synchronization, You are probably not going to be satisfied with the hardware scheduler, co it does not expose any such feature... This is not a limitation of the hardware however, as you can launch enough threads to fill out the cores and implement custom queuing using atomic ops.....

You can try implementing different queuing schemes... Just make sure that the atomic ops that you use for synching are per wavefront and not per thread... synching between threads within a wavefront should be avoided for obvious reasons..

Debdatta Basu

LeeHowes · ‎11-17-2010

A good thought, the atomics are less likely to collide that way. This is partly why my programming advice now always considers a wavefront a thread and suggests designing algorithms from that up rather than trying to pretend that the machine is running 64 threads at once and trying to group them together to avoid divergence etc between pretend threads.

Archives Discussions

Global Memory read/write synchronization