Ahh the bad old days...
OpenCL has no equivilent of "out <>." Every kernel is a scatter kernel, even when you write to images. As long as you're coalescing global writes, this shouldn't imapact performance. By the way, you can store doubles in images; put the top and bottom 32 bits of the double in two different channels of the image then read them using as_double().
I suspect the reason 2D streams were faster in Brook+ was because 1D streams were emulated using 2D streams. This required some modulo and division arithmetic (or, at best, ands and bitshifts) to get the 2D index from a 1D index. The dimensionality of threads in OpenCL shouldn't really affect performance so long as you have 64 threads per block. So 8*8 should perform as well as 64. Threads in OpenCL are really just a lightweight index, and you just need to do enough of them to fill a wavefront.
Tiling memory accesses applies for both single and double; you want neighboring threads reading neighboring addresses. It doesn't matter whether they're float, double, or int4. However, to truly maximize bandwidth, you need to use 128-bit data types.
thanks rick for pointing out the as_double(). Sounds intersting and should give it a try
Somehow I tend to disagree about the scatter outputs. CAL as well has this kind of stream non-scatter outputs, and if CAL is still the backbone for OpenCL (correct?), then there must be a way to direct the compiler to use it?
As for 2D indices, I believe for shader memory (or global memory then) it is one 1D, then the 2D should be slower (actually 2D is slower in OpenCL).
Can someone from AMD confirm?
I don't think you can get OpenCL to use the "o" register in CAL, since all textures and memory objects are inherently "scatter" in OpenCL.
Since global memory is inherently 1D, you are correct. Any higher dimensional indexing must be emulated using 1D indexing. In fact, C's support for multi-dimensional arrays is pretty abysmal so you'll usually write the indexing yourself.
OpenCL is not a streaming language, so there is no way to target the streaming outputs like existed in Brook+ from OpenCL. It is possible to hit close to peak performance using the scatter outputs as shown in the performance doc.