cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

berathebrain
Journeyman III

Reduction kernel problem

OK, so I have a follwing problem, I don't know how to parallelize the following function:

// multiply vector by vector (each vector should have one dimension equal to 1)
 float matrix_DotProduct(const int n, const float* const a, const float* const b){
  float val = 0;
  for(int j=0;j<n;j++)
    val += a * b;
  return val;
}

When I try to use reduction kernel I couldn't, because reduction kernel supports only one input and one output stream, so my question is how do I go about and make some kind of kernel/kernels that can do exactly what that function does?

Thank you for your answers.

0 Likes
2 Replies
berathebrain
Journeyman III

I may have found a solution, but it is extremly slow

The host code:

float product;

::brook::Stream<float> streamB(1,&n);
::brook::Stream<float2> streamB2(1,&n);
streamB.read(b);
matrix_combine_gpu_ati(streamB,streamB,streamB2);
matrix_DotProduct_gpu_ati(streamB2,product);

Kernel code:

kernel void matrix_combine_gpu_ati(float in1<>,float in2<>,out float2 out1<>
{
    out1.x=in1;
    out1.y=in2;
}

// multiply vector by vector (each vector should have one dimension equal to 1)
reduce void matrix_DotProduct_gpu_ati(float2 a<>,reduce float c<>{
   c += a.x * a.y;
}

0 Likes

I'm surprised your solution worked, to work it should be something like:

kernel void product(float a<>, float<>b, out float c<>)

{

c = a * b;

}

reduce void reduce_sum(float a<>, reduce float b<>)

{

b += a;

}

 

About the performance issue, reduction kernels won't help, in fact, a dot product is likely to be limited by memory bandwidth, the best you can do to help is reduce the number of trips to memory by reducing the numbers of kernels being launched, doing your own reduction, lds may help.

 

0 Likes