if your code and algorithm are correct, it should not be that way. Have you considered things like truncation error?
Floating point is riddled with these problems. Anyone who deals with them will hit them on any platform.
And yes, opencl's precision on a gpu is not identical to the precision on a cpu.
opencl defines the precision required of implementations in the specification, it is up to you to verify your maths is stable at that precision, or find ways to account for it. Such as using different maths, replace some of the functions with your own more accurate ones, etc.
First of all if we talk about basic operations ( +, -, /, * ) AMD GPUs give exactly the same results as CPU ( with exception of native double div ). For fused mad the accuracy is even higher than what CPUs can do.
Most people just simply forget that CPU/FPU uses 80 bit precision for internal registers and all operations. So only when you store float/double values in memory they are truncated to proper size/representation.
The difference is not because of some magical GPU's inaccuracies but because you compare results from 80 bit math with results from 32 or 64 bit math.
There are 2 options to get the same results on CPU. You can make basic operations that store results in memory before they are reused ( overload operators in C++ ). Or you can switch to SSE because it doesn't use this archaic FPU 80 bit mode ( you can force gcc to use sse instead of fpu ).
What type of computations are you performing?
If you are performing linear algebra computations like LU Decomposition, the condition-number of the matrix being very high (ill-conditioned) will cause the rounding errors of the GPU to be magnified many times and the end results to be significantly higher. I faced this problem and it got resolved when I used doubles. Please see my most recent post.