2 Replies Latest reply on May 18, 2009 1:15 AM by eduardoschardong

    Reduction kernel problem

    berathebrain

      OK, so I have a follwing problem, I don't know how to parallelize the following function:

      // multiply vector by vector (each vector should have one dimension equal to 1)
       float matrix_DotProduct(const int n, const float* const a, const float* const b){
        float val = 0;
        for(int j=0;j<n;j++)
          val += a[j] * b[j];
        return val;
      }

      When I try to use reduction kernel I couldn't, because reduction kernel supports only one input and one output stream, so my question is how do I go about and make some kind of kernel/kernels that can do exactly what that function does?

      Thank you for your answers.

        • Reduction kernel problem
          berathebrain

          I may have found a solution, but it is extremly slow

          The host code:

          float product;

          ::brook::Stream<float> streamB(1,&n);
          ::brook::Stream<float2> streamB2(1,&n);
          streamB.read(b);
          matrix_combine_gpu_ati(streamB,streamB,streamB2);
          matrix_DotProduct_gpu_ati(streamB2,product);

          Kernel code:

          kernel void matrix_combine_gpu_ati(float in1<>,float in2<>,out float2 out1<>
          {
              out1.x=in1;
              out1.y=in2;
          }

          // multiply vector by vector (each vector should have one dimension equal to 1)
          reduce void matrix_DotProduct_gpu_ati(float2 a<>,reduce float c<>{
             c += a.x * a.y;
          }

            • Reduction kernel problem
              eduardoschardong

              I'm surprised your solution worked, to work it should be something like:

              kernel void product(float a<>, float<>b, out float c<>)

              {

              c = a * b;

              }

              reduce void reduce_sum(float a<>, reduce float b<>)

              {

              b += a;

              }

               

              About the performance issue, reduction kernels won't help, in fact, a dot product is likely to be limited by memory bandwidth, the best you can do to help is reduce the number of trips to memory by reducing the numbers of kernels being launched, doing your own reduction, lds may help.