Hallo,

I have an OpenCL kernel that implements a dot product between two float arrays. The first is an array of size*n elements and the second is an array of n elements.

This is a sample code

_kernel

void evaluate_product(__global const float *pFirstArray,

const int n,

__global const float *pSecondArray,

__global float *pOutput)

{

int gid = get_global_id(o);

int size = get_global_size(o);

if (gid>=0 && gid<size)

{

float output = 0;

for (int k=0; k<n; k++)

{

output += pFirstArray[gid + k*size]*pSecondArray[k];

}

pOutput[gid] = output;

}

}

If I execute the same operations on CPU, I have different results, above all after 6 or 7 decimal digit. Why this strange behaviour? In kronos OpenCL specification (v 1.2) they say the x+y and x*y are correctly rounded as well as IEEE 754 compliant.

Any ideas?

Lately this is so common "problem" that I just repost my answer from other thread

First of all if we talk about basic operations ( +, -, /, * ) AMD GPUs give exactly the same results as CPU ( with exception of native double div ). For fused mad the accuracy is even higher than what CPUs can do.

Most people just simply forget that CPU/FPU uses 80 bit precision for internal registers and all operations. So only when you store float/double values in memory they are truncated to proper size/representation.

The difference is not because of GPU's inaccuracies but because you compare results from 80 bit math with results from 32 or 64 bit math.

There are 2 options to get the same results on CPU. You can make basic operations that store results in memory before they are reused ( overload operators in C++ ). Or you can switch to SSE because it doesn't use this archaic FPU 80 bit mode ( you can force gcc ( flag -fmfpmath=sse ) to use SSE instead of FPU ).