This post is an evaluation of what I have achieved and experienced using ATI's Stream Computing, and maybe I get some feedback or even some hints to where I went wrong (if I did) and how to do better.
Task: Use Stream Computing to accelerate transformation of geographical coordinates
Test machine: Intel C2D E8500, Radeon HD 4870 1GB, 4 GB DDR2 RAM 1066 MHz, WinXP SP 3
Flaws and bugs in Brook+ pose a few problems particularly to the newbie to ATI's Stream Computing SDK and tools, but thankfully these were overcome with the help of a few knowledgable forum members.
When evaluating the results of an initial coordinate transformation test implementation using single precision floating point numbers (float) it turned out that compute errors summed up so badly that the results were unuseable.
Transition to double had the big problem that the built-in math functions all work with float, causing the precision problem to arise again. It turned out that while I could provide double versions of most of these functions, not all could be replaced, so precision problems remained. Due to the nature of the subject (and me not being an expert in numerical mathematics and its application on computer systems) I have not been able to solve these.
Transforming a coordinate in my test case involves quite a few steps. I have made sure that data remained on the Stream Computing hardware (gfx card). Still, with all optimizations I could apply, I only achieved a threefold speedup compared to the CPU code path (the original coordinate transformation code), and even that only for large numbers of coordinates (the break even point was at around 1 million double precision coordinates, i.e. 3 million double values (x,y plus zone specifier)).
I have to say that I don't know whether I could improve on this, because I haven't really understood code optimization like explained in the optimized matrix multiplication example, but I doubt that speed could be more than doubled that way.
So as a result, unless I have massively goofed up somewhere (which I wouldn't be able to detect), Stream Computing has proven to be useless for my application: It lacks the speed and most of all the numerical precision required.
That is pretty disappointing I have to say. Maybe this will change should there ever be Stream Computing hardware with full native double precision support.
Thanks for all the help I have received here.