Reduction kernel problem

Discussion created by berathebrain on May 17, 2009
OK, so I have a follwing problem, I don't know how to parallelize the following function:

// multiply vector by vector (each vector should have one dimension equal to 1)
 float matrix_DotProduct(const int n, const float* const a, const float* const b){
  float val = 0;
  for(int j=0;j<n;j++)
    val += a[j] * b[j];
  return val;

When I try to use reduction kernel I couldn't, because reduction kernel supports only one input and one output stream, so my question is how do I go about and make some kind of kernel/kernels that can do exactly what that function does?

Thank you for your answers.