9 Replies Latest reply on Mar 15, 2010 7:18 PM by fir3ball

    Benchmark CPU/GPU

    fir3ball
      N-Body implementations

      How does OpenCL "CPU" code compare typically to standard compiled CPU code (with C++/Fortran compilers)?

      At first glance, looking at the n-body sample, the CPU/GPU speedup is impressive:

      • OpenCL 1 CPU: 1.4 Gflops
      • OpenCL 2 CPU: 2.8 Gflops
      • OpenCL 4 CPU: 6 Gflops
      • OpenCL GPU: 250 Gflops

      Single-CPU, a compiler still seem to have an edge over OpenCL CPU:

      • OpenCL 1 CPU: 1.4 Gflops
      • g++ : 1.5 Gflops
      • intel C : 2.15 Gflops
      • intel fortran: 2.20 Gflops

      In 4 CPU:

      • OpenCL 4 CPU: 6.2 Gflops
      • fortran openMP: 9.7 Gflops

      My main goal here is to evaluate the viability of the CPU mode for OpenCL code, and so far, it is really worth it to keep a real CPU compilation branch.

       

      Any comment on this? Am I looking at the worst test case for this?

      (Yes, I know that in the end, its really algorithm dependent... and that any speedup might not translate well to any problem.  Also, the CPU OpenCL driver is quite new and subject to improvements)

        • Benchmark CPU/GPU
          eduardoschardong

          The OpenMP case, the compiler was able to use SSE? Could you share the source?

           

            • Benchmark CPU/GPU
              fir3ball

              No, this is not suitable for SSE3 vectorization I would think, at least not with this implementation.

              (Code is attached)

              But I think that the compiler the right things given the situation:

              oclNbodyGold_f_inlined.f90(11): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
              oclNbodyGold_f_inlined.f90(9): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
              oclNbodyGold_f_inlined.f90(7): (col. 3) remark: LOOP WAS VECTORIZED.

              The output assembly has is share of xmm* register access, but its not that kind of vectorization.  Here are additional compiler diagnostics:

              oclNbodyGold_f_inlined.f90(12): (col. 3) remark: loop was not vectorized: not inner loop.
              oclNbodyGold_f_inlined.f90(13): (col. 5) remark: loop was not vectorized: low trip count.
              oclNbodyGold_f_inlined.f90(15): (col. 5) remark: loop was not vectorized: existence of vector dependence.
              oclNbodyGold_f_inlined.f90(30): (col. 5) remark: loop was not vectorized: low trip count.

              OpenMP parallelization is roughly that the loop is cutted in 4 and executed in 4 threads, pretty sure ATI's CPU-openCL does the same thing, but only a tad less fast than the totally native code.

              Edit: Single threaded, this code has the same output than the OpenCL kernels used for benchmarking.  The speedup is not a dumb error that would lead to zealous optimization by the compiler :-)

               

              SUBROUTINE computeGold( force, pos, numBodies, softeningSquared,delT) real*4 force(4,numBodies),pos(4,numBodies),softeningSquared,f(3) real*4 r(3),invDist,invDistCube,delT,acc(3) integer*4 numBodies,i,j,k force = 0 !$OMP PARALLEL !$OMP DO PRIVATE(f,distSqr,invDist,invDistCube) do i = 1,numBodies f = 0 do j = 1,numBodies r(1) = pos(1,j) - pos(1,i) r(2) = pos(2,j) - pos(2,i) r(3) = pos(3,j) - pos(3,i) distSqr = r(1)*r(1) + r(2)*r(2) + r(3)*r(3) distSqr = distSqr + softeningSquared invDist = 1/sqrt(distSqr) invDistCube = invDist*invDist*invDist s = pos(4,j)*invDistCube f = f + r*s end do do k = 1,3 pos(k,i) = pos(k,i) + force(k,i) * delT + 0.5 * f(k) * delT * delT; force(k,i) = force(k,i) * delT; end do end do !$OMP END PARALLEL END

            • Benchmark CPU/GPU
              _Big_Mac_

              I'm also interested in whether this is compared to a vetorized (SSE) native implementation

              • Benchmark CPU/GPU
                vignyan

                Hi fir3ball, 

                Do you have numbers for the matrix multiply for similar experiment? I got much better performance on my open CL CPU as compared to normal CPU function. However, i did not thread my single-CPU function, in which case, on my dual core computer, it should run faster using OpenCL. 

                Did you use multi-threading in your program while running on CPU alone? 

                  • Benchmark CPU/GPU
                    lagacep

                    (fir3ball, wrong account logged in)

                    > Do you have numbers for the matrix multiply for similar experiment?

                    Not yet, but this first experiment piqued my interest. 

                    > Did you use multi-threading in your program while running on CPU alone?

                    Yes, the OpenMP program is the equivalent of running 4-cpu multithreading, but the compiler did the job automatically.

                     

                    For a early implementation, the ATI OpenCL CPU implementation is already pretty good, but there is room for improvement to match native code.  The cool thing would be to have a single OpenCL code and run everywhere.

                  • Benchmark CPU/GPU
                    genaganna

                     

                    Originally posted by: fir3ball How does OpenCL "CPU" code compare typically to standard compiled CPU code (with C++/Fortran compilers)? At first glance, looking at the n-body sample, the CPU/GPU speedup is impressive:

                     

                     

                    • OpenCL 1 CPU: 1.4 Gflops
                    • OpenCL 2 CPU: 2.8 Gflops
                    • OpenCL 4 CPU: 6 Gflops
                    • OpenCL GPU: 250 Gflops
                    Single-CPU, a compiler still seem to have an edge over OpenCL CPU:

                     

                     

                    • OpenCL 1 CPU: 1.4 Gflops
                    • g++ : 1.5 Gflops
                    • intel C : 2.15 Gflops
                    • intel fortran: 2.20 Gflops
                    In 4 CPU:

                     

                     

                    • OpenCL 4 CPU: 6.2 Gflops
                    • fortran openMP: 9.7 Gflops
                    My main goal here is to evaluate the viability of the CPU mode for OpenCL code, and so far, it is really worth it to keep a real CPU compilation branch.

                     

                     Any comment on this? Am I looking at the worst test case for this?

                     

                    (Yes, I know that in the end, its really algorithm dependent... and that any speedup might not translate well to any problem.  Also, the CPU OpenCL driver is quite new and subject to improvements)

                     

                    Fir3ball,

                               Is it possible to give complete source code?  How you are calculating Gflops in case of OpenCL-CPU?

                      • Benchmark CPU/GPU
                        fir3ball

                         

                        Originally posted by: genaganna

                                   Is it possible to give complete source code?  How you are calculating Gflops in case of OpenCL-CPU?

                        The OpenCL code is simply the one from the n-body SDK sample.
                        To run it in CPU mode, its "--device cpu".


                        GFlops computation is the one from the SDK sample.  Its a combination of the nbody and the execution time.  Not sure of the real accuracy of it, but its more a comparative measure, as the C++/FORTRAN kernel is doing the same work.

                          • Benchmark CPU/GPU
                            ryta1203

                            So the OpenCL version is from the SDK but the fortran version is yours?

                            I'd have to say that's not an accurate comparison.

                              • Benchmark CPU/GPU
                                fir3ball

                                 

                                Originally posted by: ryta1203 So the OpenCL version is from the SDK but the fortran version is yours?

                                 

                                I'd have to say that's not an accurate comparison.

                                 

                                Well, the end result is the same, as the result can be verified to the exact same data, and the unit of measure here is based on pure timing.

                                To be fair for CL+CPU, the CL kernel would need to be optimized for best performance on a CPU while in opencl framework, as kernels might be tweaked  for ATI/nVidia.   But that would defeat the "write once in CL" - "run everywhere" scenario.

                                Bottom line (for me), if I need top-notch performance on a CPU (in the absence of a GPU), I might still need to compile a C/FORTRAN version of the function.