Right now, we have highly optimized processors capable of single precision calculations in massively parallel scenarios.
When do we see this become possible for double precision, transcendental operations, 4 dimensional double short vectors and all.
Currently, in my project LibNoiseNG, I've noticed as the modules become more complicated, to a level that may be more feasable to call well rounded and useable in today's gaming industry, the speedup is becoming less and less drastic. Much optimization will likely have to occur before the nuances and complexities that cause the Stream accelerated version to not perform much better than the CPU, can be worked out.
Seeing the times getting to be so close together in the more complicated renders, it just shows me the point those before me have made. A sacrifice is made for a matter of a few seconds of speedup, a big sacrifice, double precision.
So it just seems like the glory days for these processors are highly dependent on their ability to perform double-precision calculations with short4 types.
Not to deny the extreme performance gains that can be had however. In my renders that involve only processing one module, no compositing, the speedup is usually 6x or greater across a set of float3 inputs, 2 million or so. This obviously would add up quick if my task was to do this a thousand times a run, the CPU would quickly become inpractical in comparison to the GPU.
So once again, when do we take this next step?