cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

genaganna
Journeyman III

how to implement global synchronization ?

If you use single work group, you are not using 90% GPU resources.

What are your matrix sizes?

0 Likes
Fuxianjun
Journeyman III

how to implement global synchronization ?

Originally posted by: genaganna If you use single work group, you are not using 90% GPU resources.

What are your matrix sizes?

Any size, but not too big. I worked out that it is indeed much slower in two kernels than in one. I have to implement the whole algorithm in one kernel even thought waste much resource, because time consumed by data transportation of CPU to GPU is too long .

0 Likes
notyou
Adept III

how to implement global synchronization ?

Originally posted by: genaganna If you use single work group, you are not using 90% GPU resources.


I ran into the same problem as the OP. I can either create multiple work groups and synchronize globally just using something like clFinish() or I can use a single work group and have each thread perform multiple iterations (which for large sizes I don't think will give as good ALU utilization). So, are there any plans to support a feature like this in the future (even if it's AMD specific)?

0 Likes
AM_902
Journeyman III

how to implement global synchronization ?

Hi,

I am trying to implement a 1D FFT algorithm (Out-of-place, Radix-2 Decimation-in-Time) using a single Kernel but am running into the global synchronization problem between workgroups. Once the input size becomes larger than what can be handled by the maximum number of workgroups that can be executed concurrently for the underlying hardware with the resource utilization of our particular kernel i.e 16-registers and a workgroup size of 256.

Is there a solution to Global Synchronization available with OpenCL 1.1?

0 Likes
zeland
Journeyman III

how to implement global synchronization ?

As far as I know the only way is  throw clFinish.

0 Likes
LeeHowes
Staff
Staff

how to implement global synchronization ?

It is possible to do global synchronization if you're very careful with the use of atomics, make the global pointers volatile and use fences in the appropriate places. However, because there is no clean way to simply fill the machine with wavefronts in the current version of OpenCL you have to be very careful doing it.

If you can compute the total number of wavefronts that will fit on the device you could do it. If you cannot do this you will need to create global counters to see how many waves are concurrent on the device. Remember that you do not under any circumstances want to try to synchronize the entire dispatch because much of the dispatch will not be executing until the early waves have completed.

For most algorithms such as a matrix multiply or FFT the overhead of a second dispatch is minimal because the work groups will read data and then write data to global memory, the loop you would put inside the kernel to do global synchronization would be as much overhead as letting the hardware perform that loop for you. The dispatches themselves will amortize within he total execution time and not add much. The only time when there is an obvious advantage to doing global synchronization instead of a second dispatch is if it gives you a benefit in maintaining internal state such that you do not have to write intermediate data out, saving you global memory traffic overall. Think very carefully about whether it makes sense to use this approach over using large dispatches.

If you are seeing time savings from a single kernel over two kernels, is this because each kernel is only a single wavefront (given that you haven't implemented global synchronization I can only guess that this is what you're testing) in which case the overhead of dispatches would be very high for a small amount of work.

0 Likes
AM_902
Journeyman III

how to implement global synchronization ?

Originally posted by: LeeHowes

It is possible to do global synchronization if you're very careful with the use of atomics, make the global pointers volatile and use fences in the appropriate places.



Plz guide me where i can get more information about "volatile global pointers". I just want to investigate single kernel approach for knowledge sake.

If you are seeing time savings from a single kernel over two kernels, is this because each kernel is only a single wavefront (given that you haven't implemented global synchronization I can only guess that this is what you're testing) in which case the overhead of dispatches would be very high for a small amount of work.

Yes this is the case. For small sizes that fit into a single workgroup the single kernel is is performing better than launching two kernels.

0 Likes
LeeHowes
Staff
Staff

how to implement global synchronization ?

I just mean that you need to make sure that pointers are declared volatile (__global volatile int *blah) to ensure that the compiler expects that other work items may change that pointer rather than thinking it can keep the data in registers.

For small sizes that fit in a single workgroup most of your execution time is going to be in the runtime anyway, it's bound to go faster using one kernel than two. Don't assume that that has a meaningful relationship to running a much larger dataset with hundreds of workgroups. It would depend on how long your kernel is running for and what the overhead of a dispatch is once a kernel is already in the queue.

0 Likes