cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

thejascr
Adept II

CPU and GPU outputs dont match even for a single threaded GPU execution

I am trying to run the following piece of LUDecomposition code in an OpenCL kernel.

'A' below is a single precision floating point array.

    for( k = 0; k < N; k++ )

    {

       for(j=k+1; j < N; j++)    {

          A[k * N + j] = A[k * N + j] / A[k * N + k];

       }

       for(i=k+1; i < N; i++)    {

        for (j=k+1; j < N; j++)   {

          A[i * N + j] = A[i * N + j] - (A[i * N + k] * A[k * N + j]);

        }

      }

    }

I am running this code on the GPU on just a single GPU thread (completely sequential). So I have the global thread and local thread mapping for the kernel as follows.

    globalthread[0] = 1;

    globalthread[1] = 1;

    localthread[0] = 1;

    localthread[1] = 1;

But when I compare the GPU output to the output of the same function run on the CPU

(directly and not as an opencl device) I am seeing that the outputs dont match.

I found this unexplainable inspite of best efforts. While trying to narrow down the problem,

I found that the problem arises from the second statement. Specifically due to the subtraction operation and when the value of A goes negative.

I have made sure that both CPU and GPU are working on the same inputs. But such a strange behavior for such a simple computation seems weird. Can anyone help explain why the outputs might be differing?

I also ran it with both AMD Device and NVIDIA device and I see the same behavior in both. (to rule

out any platform specific issue)

Here is an example output:

platform name is NVIDIA CUDA
platform version is OpenCL 1.1 CUDA 4.2.1
number of devices is 2
device name is Tesla C2050 / C2070
GPU Runtime: 0.023669s
CPU Runtime: 0.000123s
Values differ at index (45, 40): cpu_val=0.946256, gpu_val=0.963078
Values differ at index (60, 52): cpu_val=-9.348129, gpu_val=-9.483719
Values differ at index (61, 52): cpu_val=11.343384, gpu_val=11.093756
Non-Matching CPU-GPU Outputs Beyond Error Threshold of 1.05 Percent: 3

0 Likes
4 Replies
kbentley57
Adept I

While not an answer to your question in the sense of OpenCL, have you tried to alter the program to use double precision values with the same algorithm, just to see if the strange behavior is the result of rounding?

Thank you!

Yes that did solve the problem!!

But I was really surprised at the level of inaccuracy for single precision floating points between CPU and GPU.

But now I guess I have an explanation: (please correct me if I am wrong)

1. Using single precision floats caused a small difference in the intermediate float point values computed.

2. But my matrix 'A' has a 1-norm condition number of 7.087e+3. Because of this I am losing upto 3 digits of

accuracy on top of the rounding errors. This ill-conditioning caused the final result differences to be significantly high.

Whereas when I used double there was no rounding errors between CPU and GPU. Hence even with a ill-conditioned matrix I was able to get accurate results.

0 Likes
d_a_a_
Adept II

The OpenCL compiler might be using a fused multiply-add operation (FMA) on the GPUs in order to speed up things and increase precision (the intermediate rounding is avoided). Since most current CPUs still don't implement FMA the statement cannot be optimized and thus the operation must be done in two steps, loosing some precision.

Thank you for the pointer!

To verify whether this is the case I built the OpenCL kernel with "-cl-opt-disable". This should have disabled all

the math optimizations including FMA?..right? But even with this disabled only a few of the mismatches reduced

and majority of them remained. So I guess the reason for the inaccuracy was the rounding error along with the

ill-conditioned matrix I was using.

Also, I saw the option "-cl-mad-enable" that OpenCL specifies for clBuildProgram. Is this the FMA option you were referring to? OpenCL says this is disabled by default and has to be explicitly enabled. Is that how AMD runtime work?

or does the AMD's opencl compiler enable this by default? if so how do i disable this?

0 Likes