I've tried to write some test code using the LDS as I think is an interesting feature, however looks like using scatter for data output usually is slower than rewriting the algorithm to use multiple streaming kernels.
While using GroupSize attribute Brook+ compiler doesn't allow stream outputs, only scatter ones. Is this a hardware or a software limitation? Will it change in the future?
I think that even if the code scatters to sequential array positions the write is unbuffered, so it is really slow and if the computation is small this becomes soon a bottleneck. Also, contrary to other samples, LDS tutorial in Brook+ directory doesn't have a benchmark option to compare with CPU.