Since Brook+ does not support local arrays, I had to try a number of things to get my code to work, all of which have been rather frustrating. It seems that my problem could be solved easily if Brook+ has either of the following features:
1. allow the creation of local arrays, or
2. allow the expansion of the dimensionality of a stream (i.e. the reverse of reduce)
To try to get around this problem, I tried two suggestions offered in the forum:
1. Unrolled the loops explicitly: This worked for very smalll problems, but certainly isn't very satisfactory. Becasue my problem also creates additional data per stream, I had to define a struct to move all the data out. This also will not work for larger problems.
2. Micah Villmow suggest using global memory for local buffers and suggested that this should not impact performance. I packed everything into a long scatter stream (since Brook+ only permits one), but found that the size limit was 8192 for a 1d stream (anything beyond this returns garbage, which took me a day to figure out why!). Then I converted the buffer to a 2d scatter stream. That worked, but the performance seems to take a hit compared to method #1 (presumably because of memory congestion if there are lots of read/write).
3. Lastly, one very peculiar thing. Since I store everything in global memory, I figured I only need an input stream which defines the size and one scatter stream. I was surprised to find that the code ran very much slower than method #1. But accidentally, I discovered that if I add an output stream (with the same size as the input stream) to the kernel call, the code sped up tremendously. The output stream was never written to. Why does this cause any speed difference? Would anyone be able to shed some light on this?
4. Finally, when using a 2d scatter stream as local buffer, the ordering of the indices makes a performance difference (though this is not unexpected).
Conclusion: A lot of the performance aspects of Brook+ are non-intuitive and hard to understand. Tuning by trial-and-error is frustrating, not to mention the time I had to spend trying to figure out the undocumented limitations and/or bugs of Brook+.