4 Replies Latest reply on Nov 4, 2012 7:35 AM by thejascr

    CPU and GPU outputs dont match even for a single threaded GPU execution

    thejascr

      I am trying to run the following piece of LUDecomposition code in an OpenCL kernel.

      'A' below is a single precision floating point array.

       

          for( k = 0; k < N; k++ )

          {

             for(j=k+1; j < N; j++)    {

                A[k * N + j] = A[k * N + j] / A[k * N + k];

             }

             for(i=k+1; i < N; i++)    {

              for (j=k+1; j < N; j++)   {

                A[i * N + j] = A[i * N + j] - (A[i * N + k] * A[k * N + j]);

              }

            }

          }

       

       

      I am running this code on the GPU on just a single GPU thread (completely sequential). So I have the global thread and local thread mapping for the kernel as follows.

       

          globalthread[0] = 1;

          globalthread[1] = 1;

          localthread[0] = 1;

          localthread[1] = 1;

       

      But when I compare the GPU output to the output of the same function run on the CPU

      (directly and not as an opencl device) I am seeing that the outputs dont match.

      I found this unexplainable inspite of best efforts. While trying to narrow down the problem,

      I found that the problem arises from the second statement. Specifically due to the subtraction operation and when the value of A[i][j] goes negative.

      I have made sure that both CPU and GPU are working on the same inputs. But such a strange behavior for such a simple computation seems weird. Can anyone help explain why the outputs might be differing?

      I also ran it with both AMD Device and NVIDIA device and I see the same behavior in both. (to rule

      out any platform specific issue)

       

      Here is an example output:

       

      platform name is NVIDIA CUDA
      platform version is OpenCL 1.1 CUDA 4.2.1
      number of devices is 2
      device name is Tesla C2050 / C2070
      GPU Runtime: 0.023669s
      CPU Runtime: 0.000123s
      Values differ at index (45, 40): cpu_val=0.946256, gpu_val=0.963078
      Values differ at index (60, 52): cpu_val=-9.348129, gpu_val=-9.483719
      Values differ at index (61, 52): cpu_val=11.343384, gpu_val=11.093756
      Non-Matching CPU-GPU Outputs Beyond Error Threshold of 1.05 Percent: 3

        • Re: CPU and GPU outputs dont match even for a single threaded GPU execution
          kbentley57

          While not an answer to your question in the sense of OpenCL, have you tried to alter the program to use double precision values with the same algorithm, just to see if the strange behavior is the result of rounding?

          1 of 1 people found this helpful
            • Re: CPU and GPU outputs dont match even for a single threaded GPU execution
              thejascr

              Thank you!

              Yes that did solve the problem!!

              But I was really surprised at the level of inaccuracy for single precision floating points between CPU and GPU.

              But now I guess I have an explanation: (please correct me if I am wrong)

               

              1. Using single precision floats caused a small difference in the intermediate float point values computed.

              2. But my matrix 'A' has a 1-norm condition number of 7.087e+3. Because of this I am losing upto 3 digits of

              accuracy on top of the rounding errors. This ill-conditioning caused the final result differences to be significantly high.

               

              Whereas when I used double there was no rounding errors between CPU and GPU. Hence even with a ill-conditioned matrix I was able to get accurate results.

            • Re: CPU and GPU outputs dont match even for a single threaded GPU execution
              d.a.a.

              The OpenCL compiler might be using a fused multiply-add operation (FMA) on the GPUs in order to speed up things and increase precision (the intermediate rounding is avoided). Since most current CPUs still don't implement FMA the statement cannot be optimized and thus the operation must be done in two steps, loosing some precision.

              1 of 1 people found this helpful
                • Re: CPU and GPU outputs dont match even for a single threaded GPU execution
                  thejascr

                  Thank you for the pointer!

                  To verify whether this is the case I built the OpenCL kernel with "-cl-opt-disable". This should have disabled all

                  the math optimizations including FMA?..right? But even with this disabled only a few of the mismatches reduced

                  and majority of them remained. So I guess the reason for the inaccuracy was the rounding error along with the

                  ill-conditioned matrix I was using.

                   

                  Also, I saw the option "-cl-mad-enable" that OpenCL specifies for clBuildProgram. Is this the FMA option you were referring to? OpenCL says this is disabled by default and has to be explicitly enabled. Is that how AMD runtime work?

                  or does the AMD's opencl compiler enable this by default? if so how do i disable this?