As far as I can tell you can't use globals in Brook+ kernels.
The strong type checking stops all those nice shortcuts you get in OpenGL/D3D :frown;
You won't get increased ALU throughput because double-precision multiply and multiply-add both use 4 lanes out of the 5 lanes. Double-precision add can double in throughput, as it only uses 2 lanes. Unfortunately the compilers don't identify this in your code, so it only issues one DADD at a time.
Double-precision divide is slow.
By using double2 for streams and gather/scatter operations, instead of double, you will improve performance against memory - reducing the count of reads/writes against memory for a given amount of data.
I put this:
kernel double2 SinD2 (double2 phi)
double2 _f1 = double2(1.0, 1.0);
double2 _f6 = double2(6.0, 6.0);
double2 _f20 = double2(20.0, 20.0);
double2 _f42 = double2(42.0, 42.0);
double2 _f72 = double2(72.0, 72.0);
double2 _f110 = double2(110.0, 110.0);
double2 _f156 = double2(156.0, 156.0);
double2 _f210 = double2(210.0, 210.0);
double2 _f272 = double2(272.0, 272.0);
double2 phi2 = phi * phi;
return phi * (_f1 - phi2 / _f6 * (_f1 - phi2 / _f20 * (_f1 - phi2 / _f42 * (_f1 - phi2 / _f72 * (_f1 - phi2 / _f110 * (_f1 - phi2 / _f156 * (_f1 - phi2 / _f210 * (_f1 - phi2 / _f272))))))));
kernel void test(double2 A<>, out double2 B<>)
into Stream Kernel Analyzer where the instructions that run on the chip can be seen.
You can't use global variables in kernels, if you use them only in that kernel you can put them inside, if you need to use them in several places and you want to avoid code replication you can use preprocessor "#define". Brook 1.4 supports some preprocessor directives, just add the "-pp" flag when you call BRCC to compile the source code.
The expression you say doesn't compile because in the current version variables must be of the same type, "V" is a double2 value, but 0.5 is a double. You should write:
double2 v = double2(1.0, 1.0);
v = v * double2(0.5, 0.5);
Another solution would be disabling strong type checking using "-a" flag, your code will compile and multiply each component by 0.5 as you say.
I think using double2 instead double is better for fetching several values together from memory, inside kernels arithmetic expressions will be optimized by the compiler automatically, reordering instructions to use as many processing units as possible (you can check this using KernelAnalyzer).
By the way, I think you can avoid some of those divisions by using the reciprocal elements in the constants. According to KernelAnalyzer the compiler is smart enough to optimize them, however it could miss some optimizations in complex expressions or kernel parameters.
Ummm ... yeah, thanks, but as I wrote this was solved for me already.
Thanks for the performance hints though.