How does OpenCL "CPU" code compare typically to standard compiled CPU code (with C++/Fortran compilers)?
At first glance, looking at the n-body sample, the CPU/GPU speedup is impressive:
Single-CPU, a compiler still seem to have an edge over OpenCL CPU:
In 4 CPU:
My main goal here is to evaluate the viability of the CPU mode for OpenCL code, and so far, it is really worth it to keep a real CPU compilation branch.
Any comment on this? Am I looking at the worst test case for this?
(Yes, I know that in the end, its really algorithm dependent... and that any speedup might not translate well to any problem. Also, the CPU OpenCL driver is quite new and subject to improvements)
The OpenMP case, the compiler was able to use SSE? Could you share the source?
No, this is not suitable for SSE3 vectorization I would think, at least not with this implementation.
(Code is attached)
But I think that the compiler the right things given the situation:
oclNbodyGold_f_inlined.f90(11): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
oclNbodyGold_f_inlined.f90(9): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
oclNbodyGold_f_inlined.f90(7): (col. 3) remark: LOOP WAS VECTORIZED.
The output assembly has is share of xmm* register access, but its not that kind of vectorization. Here are additional compiler diagnostics:
oclNbodyGold_f_inlined.f90(12): (col. 3) remark: loop was not vectorized: not inner loop.
oclNbodyGold_f_inlined.f90(13): (col. 5) remark: loop was not vectorized: low trip count.
oclNbodyGold_f_inlined.f90(15): (col. 5) remark: loop was not vectorized: existence of vector dependence.
oclNbodyGold_f_inlined.f90(30): (col. 5) remark: loop was not vectorized: low trip count.
OpenMP parallelization is roughly that the loop is cutted in 4 and executed in 4 threads, pretty sure ATI's CPU-openCL does the same thing, but only a tad less fast than the totally native code.
Edit: Single threaded, this code has the same output than the OpenCL kernels used for benchmarking. The speedup is not a dumb error that would lead to zealous optimization by the compiler 🙂
SUBROUTINE computeGold( force, pos, numBodies, softeningSquared,delT) real*4 force(4,numBodies),pos(4,numBodies),softeningSquared,f(3) real*4 r(3),invDist,invDistCube,delT,acc(3) integer*4 numBodies,i,j,k force = 0 !$OMP PARALLEL !$OMP DO PRIVATE(f,distSqr,invDist,invDistCube) do i = 1,numBodies f = 0 do j = 1,numBodies r(1) = pos(1,j) - pos(1,i) r(2) = pos(2,j) - pos(2,i) r(3) = pos(3,j) - pos(3,i) distSqr = r(1)*r(1) + r(2)*r(2) + r(3)*r(3) distSqr = distSqr + softeningSquared invDist = 1/sqrt(distSqr) invDistCube = invDist*invDist*invDist s = pos(4,j)*invDistCube f = f + r*s end do do k = 1,3 pos(k,i) = pos(k,i) + force(k,i) * delT + 0.5 * f(k) * delT * delT; force(k,i) = force(k,i) * delT; end do end do !$OMP END PARALLEL END
I'm also interested in whether this is compared to a vetorized (SSE) native implementation
Hi fir3ball,
Do you have numbers for the matrix multiply for similar experiment? I got much better performance on my open CL CPU as compared to normal CPU function. However, i did not thread my single-CPU function, in which case, on my dual core computer, it should run faster using OpenCL.
Did you use multi-threading in your program while running on CPU alone?
(fir3ball, wrong account logged in)
> Do you have numbers for the matrix multiply for similar experiment?
Not yet, but this first experiment piqued my interest.
> Did you use multi-threading in your program while running on CPU alone?
Yes, the OpenMP program is the equivalent of running 4-cpu multithreading, but the compiler did the job automatically.
For a early implementation, the ATI OpenCL CPU implementation is already pretty good, but there is room for improvement to match native code. The cool thing would be to have a single OpenCL code and run everywhere.
Originally posted by: fir3ball How does OpenCL "CPU" code compare typically to standard compiled CPU code (with C++/Fortran compilers)? At first glance, looking at the n-body sample, the CPU/GPU speedup is impressive:
Single-CPU, a compiler still seem to have an edge over OpenCL CPU:
- OpenCL 1 CPU: 1.4 Gflops
- OpenCL 2 CPU: 2.8 Gflops
- OpenCL 4 CPU: 6 Gflops
- OpenCL GPU: 250 Gflops
In 4 CPU:
- OpenCL 1 CPU: 1.4 Gflops
- g++ : 1.5 Gflops
- intel C : 2.15 Gflops
- intel fortran: 2.20 Gflops
My main goal here is to evaluate the viability of the CPU mode for OpenCL code, and so far, it is really worth it to keep a real CPU compilation branch.
- OpenCL 4 CPU: 6.2 Gflops
- fortran openMP: 9.7 Gflops
Any comment on this? Am I looking at the worst test case for this?
(Yes, I know that in the end, its really algorithm dependent... and that any speedup might not translate well to any problem. Also, the CPU OpenCL driver is quite new and subject to improvements)
Fir3ball,
Is it possible to give complete source code? How you are calculating Gflops in case of OpenCL-CPU?
Originally posted by: genaganna
Is it possible to give complete source code? How you are calculating Gflops in case of OpenCL-CPU?
The OpenCL code is simply the one from the n-body SDK sample.
To run it in CPU mode, its "--device cpu".
GFlops computation is the one from the SDK sample. Its a combination of the nbody and the execution time. Not sure of the real accuracy of it, but its more a comparative measure, as the C++/FORTRAN kernel is doing the same work.
So the OpenCL version is from the SDK but the fortran version is yours?
I'd have to say that's not an accurate comparison.
Originally posted by: ryta1203 So the OpenCL version is from the SDK but the fortran version is yours?
I'd have to say that's not an accurate comparison.
Well, the end result is the same, as the result can be verified to the exact same data, and the unit of measure here is based on pure timing.
To be fair for CL+CPU, the CL kernel would need to be optimized for best performance on a CPU while in opencl framework, as kernels might be tweaked for ATI/nVidia. But that would defeat the "write once in CL" - "run everywhere" scenario.
Bottom line (for me), if I need top-notch performance on a CPU (in the absence of a GPU), I might still need to compile a C/FORTRAN version of the function.