In my experience the most effective way (but usually not very convenient) is to pack the same index of the array in subsequent memory addresses while striding subsequent indices by the global work size.
So, if you have 4096 work items, each to copy out 1024 bytes, you get 4KiB of value[0], then 4KiB of value[1] and so on.
This way, when processed the various wavefront generate extremely efficient packed writes.