cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

fir3ball
Journeyman III

Benchmark CPU/GPU

N-Body implementations

How does OpenCL "CPU" code compare typically to standard compiled CPU code (with C++/Fortran compilers)?

At first glance, looking at the n-body sample, the CPU/GPU speedup is impressive:

  • OpenCL 1 CPU: 1.4 Gflops
  • OpenCL 2 CPU: 2.8 Gflops
  • OpenCL 4 CPU: 6 Gflops
  • OpenCL GPU: 250 Gflops

Single-CPU, a compiler still seem to have an edge over OpenCL CPU:

  • OpenCL 1 CPU: 1.4 Gflops
  • g++ : 1.5 Gflops
  • intel C : 2.15 Gflops
  • intel fortran: 2.20 Gflops

In 4 CPU:

  • OpenCL 4 CPU: 6.2 Gflops
  • fortran openMP: 9.7 Gflops

My main goal here is to evaluate the viability of the CPU mode for OpenCL code, and so far, it is really worth it to keep a real CPU compilation branch.

 

Any comment on this? Am I looking at the worst test case for this?

(Yes, I know that in the end, its really algorithm dependent... and that any speedup might not translate well to any problem.  Also, the CPU OpenCL driver is quite new and subject to improvements)

Tags (2)
0 Likes
9 Replies
eduardoschardong
Journeyman III

Benchmark CPU/GPU

The OpenMP case, the compiler was able to use SSE? Could you share the source?

 

0 Likes
_Big_Mac_
Journeyman III

Benchmark CPU/GPU

I'm also interested in whether this is compared to a vetorized (SSE) native implementation

0 Likes
fir3ball
Journeyman III

Benchmark CPU/GPU

No, this is not suitable for SSE3 vectorization I would think, at least not with this implementation.

(Code is attached)

But I think that the compiler the right things given the situation:

oclNbodyGold_f_inlined.f90(11): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
oclNbodyGold_f_inlined.f90(9): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
oclNbodyGold_f_inlined.f90(7): (col. 3) remark: LOOP WAS VECTORIZED.

The output assembly has is share of xmm* register access, but its not that kind of vectorization.  Here are additional compiler diagnostics:

oclNbodyGold_f_inlined.f90(12): (col. 3) remark: loop was not vectorized: not inner loop.
oclNbodyGold_f_inlined.f90(13): (col. 5) remark: loop was not vectorized: low trip count.
oclNbodyGold_f_inlined.f90(15): (col. 5) remark: loop was not vectorized: existence of vector dependence.
oclNbodyGold_f_inlined.f90(30): (col. 5) remark: loop was not vectorized: low trip count.

OpenMP parallelization is roughly that the loop is cutted in 4 and executed in 4 threads, pretty sure ATI's CPU-openCL does the same thing, but only a tad less fast than the totally native code.

Edit: Single threaded, this code has the same output than the OpenCL kernels used for benchmarking.  The speedup is not a dumb error that would lead to zealous optimization by the compiler 🙂

 

SUBROUTINE computeGold( force, pos, numBodies, softeningSquared,delT) real*4 force(4,numBodies),pos(4,numBodies),softeningSquared,f(3) real*4 r(3),invDist,invDistCube,delT,acc(3) integer*4 numBodies,i,j,k force = 0 !$OMP PARALLEL !$OMP DO PRIVATE(f,distSqr,invDist,invDistCube) do i = 1,numBodies f = 0 do j = 1,numBodies r(1) = pos(1,j) - pos(1,i) r(2) = pos(2,j) - pos(2,i) r(3) = pos(3,j) - pos(3,i) distSqr = r(1)*r(1) + r(2)*r(2) + r(3)*r(3) distSqr = distSqr + softeningSquared invDist = 1/sqrt(distSqr) invDistCube = invDist*invDist*invDist s = pos(4,j)*invDistCube f = f + r*s end do do k = 1,3 pos(k,i) = pos(k,i) + force(k,i) * delT + 0.5 * f(k) * delT * delT; force(k,i) = force(k,i) * delT; end do end do !$OMP END PARALLEL END

0 Likes
vignyan
Journeyman III

Benchmark CPU/GPU

Hi fir3ball, 

Do you have numbers for the matrix multiply for similar experiment? I got much better performance on my open CL CPU as compared to normal CPU function. However, i did not thread my single-CPU function, in which case, on my dual core computer, it should run faster using OpenCL. 

Did you use multi-threading in your program while running on CPU alone? 

0 Likes
lagacep
Journeyman III

Benchmark CPU/GPU

(fir3ball, wrong account logged in)

> Do you have numbers for the matrix multiply for similar experiment?

Not yet, but this first experiment piqued my interest. 

> Did you use multi-threading in your program while running on CPU alone?

Yes, the OpenMP program is the equivalent of running 4-cpu multithreading, but the compiler did the job automatically.

 

For a early implementation, the ATI OpenCL CPU implementation is already pretty good, but there is room for improvement to match native code.  The cool thing would be to have a single OpenCL code and run everywhere.

0 Likes
genaganna
Journeyman III

Benchmark CPU/GPU

Originally posted by: fir3ball How does OpenCL "CPU" code compare typically to standard compiled CPU code (with C++/Fortran compilers)? At first glance, looking at the n-body sample, the CPU/GPU speedup is impressive:

 

  • OpenCL 1 CPU: 1.4 Gflops
  • OpenCL 2 CPU: 2.8 Gflops
  • OpenCL 4 CPU: 6 Gflops
  • OpenCL GPU: 250 Gflops
Single-CPU, a compiler still seem to have an edge over OpenCL CPU:

 

  • OpenCL 1 CPU: 1.4 Gflops
  • g++ : 1.5 Gflops
  • intel C : 2.15 Gflops
  • intel fortran: 2.20 Gflops
In 4 CPU:

 

  • OpenCL 4 CPU: 6.2 Gflops
  • fortran openMP: 9.7 Gflops
My main goal here is to evaluate the viability of the CPU mode for OpenCL code, and so far, it is really worth it to keep a real CPU compilation branch.

 

 Any comment on this? Am I looking at the worst test case for this?

 

(Yes, I know that in the end, its really algorithm dependent... and that any speedup might not translate well to any problem.  Also, the CPU OpenCL driver is quite new and subject to improvements)

 

Fir3ball,

           Is it possible to give complete source code?  How you are calculating Gflops in case of OpenCL-CPU?

0 Likes
fir3ball
Journeyman III

Benchmark CPU/GPU

Originally posted by: genaganna

           Is it possible to give complete source code?  How you are calculating Gflops in case of OpenCL-CPU?

The OpenCL code is simply the one from the n-body SDK sample.
To run it in CPU mode, its "--device cpu".


GFlops computation is the one from the SDK sample.  Its a combination of the nbody and the execution time.  Not sure of the real accuracy of it, but its more a comparative measure, as the C++/FORTRAN kernel is doing the same work.

0 Likes
ryta1203
Journeyman III

Benchmark CPU/GPU

So the OpenCL version is from the SDK but the fortran version is yours?

I'd have to say that's not an accurate comparison.

0 Likes
fir3ball
Journeyman III

Benchmark CPU/GPU

Originally posted by: ryta1203 So the OpenCL version is from the SDK but the fortran version is yours?

 

I'd have to say that's not an accurate comparison.

 

Well, the end result is the same, as the result can be verified to the exact same data, and the unit of measure here is based on pure timing.

To be fair for CL+CPU, the CL kernel would need to be optimized for best performance on a CPU while in opencl framework, as kernels might be tweaked  for ATI/nVidia.   But that would defeat the "write once in CL" - "run everywhere" scenario.

Bottom line (for me), if I need top-notch performance on a CPU (in the absence of a GPU), I might still need to compile a C/FORTRAN version of the function.

 

0 Likes