For small patches of cloth animations, 2D fluid computation or even some reduction techniques, will there be an option to make __local variables stationary until next kernel execution or at least for repeating the same kernel without touching memory?
In basic, as a workaround for ping-pong technique so everything is done in gpu purely.
Im an opencl beggineer, already did some nbody and 2d fluid java programs that harness gpu via jocl and Im looking for some new optimizations.
For example, how would a 2d-fluid compute performance improve if half of the local memory is dedicated to such communications between kernels? Does it drop due to poor utilization/occupation or does it increase because memory fetching is decreased by a good margin? For example, I have a HD7870@(1100/1200).
It sounds weird but, if I have 1280 cores and n<=1280 planets(64 is computed per compute unit, updated by broadcasting), Then can I do n-body calculations with this optimization using full potential of 2.5 TFlops? Otherwise it doesnt put enough load on my gpu.
Maybe multiplication of pre-cached matrices can be another example?
Thanks for your time.