I'm surprised your solution worked, to work it should be something like:
kernel void product(float a<>, float<>b, out float c<>)
{
c = a * b;
}
reduce void reduce_sum(float a<>, reduce float b<>)
{
b += a;
}
About the performance issue, reduction kernels won't help, in fact, a dot product is likely to be limited by memory bandwidth, the best you can do to help is reduce the number of trips to memory by reducing the numbers of kernels being launched, doing your own reduction, lds may help.