1 Reply Latest reply on Jan 11, 2016 8:05 AM by Meteorhead

    dot product reduction

    lightjohn

      Hello, newbie here.

       

      I want to do a  dot product like in this tuto: http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/

      http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-red…

       

      My operation is more like that:

       

      float reduce_sum(double* inputA, double* inputB , int length) {
        float accumulator = 0.0;
        for(int i = 0; i < length; i++) 
        accumulator += inputA[i]*inputB[i];
        return accumulator;
      }

       

      But the tutorial is far too complex for now, and here my current kernel:

       

      #pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable
      
      void MyAtomicAdd(__global double *val, double delta) {
          union {double f;ulong  i;} old;
          union {double f;ulong  i;} new;
          do {
            old.f = *val;
            new.f = old.f + delta;
          } while (atom_cmpxchg ( (volatile __global ulong *)val, old.i, new.i) != old.i);
      }
      
      __kernel void inner(__global double *out, __global double *x, __global double *p)
      {
          int gid = get_global_id(0);
          MyAtomicAdd(out, x[gid]*p[gid]);
      }
      
      
      

       

      It's working but it's very very very slow (2second). So I thought I could use:

       

      atom_add((volatile __global ulong *)out, (ulong) (x[gid]*p[gid])) instead of MyAtomicAdd but the output is not good so I think even if ulong and double have the same size it's not enough.Also if someone could tell me how to make an atomic add with double, I would be very grateful because google is not very helping in this case.

       

      Can you help me ?

      Thank you for any advice.

        • Re: dot product reduction
          Meteorhead

          The reason why the case study is so complicated, is because there is no such thing as floating point atomics in OpenCL. CUDA and Nv hardware have such a feature, but even they do not expose it as an OpenCL extension. That is the reason why you have to use a 2 step reduction as explained in the case study. OpenCL 2.0 provides you with work-group level reduction methods, so reducing over built-in operators is extremely simple.