There was a couple of things we did in Brook+ which greatly improved performance, and I would like to get some feedback on the OpenCL analogy (if any). Please note that all the following is based on experience with "double" DP:
1- We used to have the output in the "double out<>" form rather than scatter "double out [idx]" form. It did not degrade the performance if some of the RHS input where gather streams. The closest thing I found in OpenCL was the image objects, but this is only for four 32bit .xyzw components of a 128b word. In brook, it worked great in double DP.
The brook scatter out out[idx]=.... was horribaly slow, which leads me to think this is the current mode of OpenCL output.
I did try a simple "out[global_id] = in [global_id]" kernel in both Brook& OpenCL, and I believe I did not see any significant improvements, compared to a Brook "out<> = in <>"
2- We used to have 2D streams, again in double precision. This used to get a factor of 5x improvement in some cases. It was not clear then if this was because of the data being put in memory in 2D form, or because the trheads were put in 2D form. In this mode of brook streams, there was one-to-one analogy between thread indices & data indices. This was the mode again which did not need threadi IDs, and did not have scatter outputs: "out <dim1, dim2>. I tried 2D threads in OpenCL, and did not seem to improve anything.
The CAL & OpenCL documentation when it talks about memory access patterns, such as tiled patterns, does not clarify what is the size of the data elements (32b or 64b). My guess it was 32b, which does not explain any speedup for DP.
I hope the doumentation will give detailed explanations to DP data arrangements. I think DP is the one used in all numerical packages