Could you clarify what you mean by copying the data from 1024 work-items into a private float array? In OpenCL, private memory cannot be seen by other work-items. By definition, private memory is private to a work-item. You can use local memory to share results between work-items, but that would only work for work-items within the same group. Perhaps, you could change the size of the work group to 1024 work-items and use local memory if your GPU supports that?
You can't create a 'group of groups', however you can launch your kernel with two or three dimensions which will allow you to have X * Y or X * Y * Z work-groups respectively.