Floating point operations difference between CPU and GPU

Question asked by kahlan on Nov 15, 2012
I have an OpenCL kernel that implements a dot product between two float arrays. The first is an array of size*n elements and the second is an array of n elements.

This is a sample code



    void evaluate_product(__global const float *pFirstArray,

                          const int n,

                          __global const float *pSecondArray,

                          __global float *pOutput)


int gid = get_global_id(o);


int size = get_global_size(o); 


         if (gid>=0 && gid<size)


          float output = 0;

          for (int k=0; k<n; k++)


            output += pFirstArray[gid + k*size]*pSecondArray[k]; 



          pOutput[gid] = output;






If I execute the same operations on CPU, I have different results, above all after 6 or 7 decimal digit. Why this strange behaviour? In kronos OpenCL specification (v 1.2) they say the x+y and x*y are correctly rounded as well as IEEE 754 compliant.

Any ideas?