OK, so I have a follwing problem, I don't know how to parallelize the following function:

// multiply vector by vector (each vector should have one dimension equal to 1)

float matrix_DotProduct(const int n, const float* const a, const float* const b){

float val = 0;

for(int j=0;j<n;j++)

val += a[j] * b[j];

return val;

}

When I try to use reduction kernel I couldn't, because reduction kernel supports only one input and one output stream, so my question is how do I go about and make some kind of kernel/kernels that can do exactly what that function does?

Thank you for your answers.

I may have found a solution, but it is extremly slow

The host code:

float product;

::brook::Stream<float> streamB(1,&n);

::brook::Stream<float2> streamB2(1,&n);

streamB.read(b);

matrix_combine_gpu_ati(streamB,streamB,streamB2);

matrix_DotProduct_gpu_ati(streamB2,product);

Kernel code:

kernel void matrix_combine_gpu_ati(float in1<>,float in2<>,out float2 out1<>

{

out1.x=in1;

out1.y=in2;

}

// multiply vector by vector (each vector should have one dimension equal to 1)

reduce void matrix_DotProduct_gpu_ati(float2 a<>,reduce float c<>{

c += a.x * a.y;

}