In one kernel call. Any suggestions?

I have one big 2D array.

By addtion of its near elements along one of axises (let it be axis Y) next 2D array formed. After that just produced array used as source for next iteration.

That is M*N -> M*(N/2) and so on.

Currently I call kernel to do each level of such addition.

That is, at kernel beginning data should be loaded into registers each time.

Maybe there is some way exist to save on data reloading (and on additional kernel calls) and to produce few 2D arrays from initial one in single kernel call?

But these output arrays will have different Y-dimension sizes.

If I will use scatter stream, performance will degrade considerably (more than I could get from producing 2 output arrays in one kernel call).

It could be possible if original array will be float8 type (for example), first output float8 too and second output array float4 type.

That way both output arrays will be of same dimensions Mx(N/2) of elements.

But of course such changes of types will complex algorithm much.

Maybe some other ways exist to perform same task?

I have one big 2D array.

By addtion of its near elements along one of axises (let it be axis Y) next 2D array formed. After that just produced array used as source for next iteration.

That is M*N -> M*(N/2) and so on.

Currently I call kernel to do each level of such addition.

That is, at kernel beginning data should be loaded into registers each time.

Maybe there is some way exist to save on data reloading (and on additional kernel calls) and to produce few 2D arrays from initial one in single kernel call?

But these output arrays will have different Y-dimension sizes.

If I will use scatter stream, performance will degrade considerably (more than I could get from producing 2 output arrays in one kernel call).

It could be possible if original array will be float8 type (for example), first output float8 too and second output array float4 type.

That way both output arrays will be of same dimensions Mx(N/2) of elements.

But of course such changes of types will complex algorithm much.

Maybe some other ways exist to perform same task?