I've tried to write some test code using the LDS as I think is an interesting feature, however looks like using scatter for data output usually is slower than rewriting the algorithm to use multiple streaming kernels.
While using GroupSize attribute Brook+ compiler doesn't allow stream outputs, only scatter ones. Is this a hardware or a software limitation? Will it change in the future?
I think that even if the code scatters to sequential array positions the write is unbuffered, so it is really slow and if the computation is small this becomes soon a bottleneck. Also, contrary to other samples, LDS tutorial in Brook+ directory doesn't have a benchmark option to compare with CPU.
This is a hardware limitation. LDS can be used only in compute shader mode that only allows scatter streams.
Brook+ implementation of Scatter does some copy of data from linear memory to tiled memory in case you use a 2D scatter stream. You can avoid it by using 1D scatter streams.