cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

joofa
Journeyman III

Interesting result regarding AMD APP CPU Kernel Vs. Apple OpenCL

Hi All,

I took a standard NVIDIA OpenCL sample (that simply blurs an input image) and ran it on my Mac Book Air (MBA) that has an NVIDIA 320M GPU using the Apple-supplied OpenCL 1.0 framework. I wanted to compare it with AMD APP SDK. So, I ran a Windows XP SP3 session on the MBA using VmWare Fusion 3.1.3. Apparently, I could not access the GPU under the virtual XP environment. However, fortunately, APP SDK allowed me to run the OpenCL kerenl on the CPU. I also tried an equivalent code on the CPU that does the same thing as the OpenCL kernel. The equivalent code was compiled using GCC on Mac and Visual Studio 2008 on XP. The equivalent codes on both Mac and XP ran more or less at the same speed and much slower than the GPU code on the Mac, as expected. I was expecting the OpenCL kernel run on the CPU under XP through the APP SDK to show a similarly slower peformance as the equivalent codes. However, I was pleasently suprised to see that it ran much faster and just a tad slower than the Mac GPU code. The following link shows a video of all four configurations.

http://djjoofa.com/data/videos/osx_xp_opencl.mov

At this stage I'm not sure why AMD APP SDK OpenCL kerenl when run on the CPU is much faster than an equivalent GCC/VS 2008 compiled code.

Sincerely,

Joofa

 

0 Likes
8 Replies

You can find out more information on our CPU implementation here:
http://dl.acm.org/citation.cfm?id=1854302
0 Likes

Thanks Micah, I shall check it out.

 

0 Likes

you should use bootcamp for testing. their will be major performance hit using virtualization.

0 Likes

you should use bootcamp for testing. their will be major performance hit using virtualization.

0 Likes
notzed
Challenger

Originally posted by: joofa Hi All,

 

 I also tried an equivalent code on the CPU that does the same thing as the OpenCL kernel. The equivalent code was compiled using GCC on Mac and Visual Studio 2008 on XP.

..

 

At this stage I'm not sure why AMD APP SDK OpenCL kerenl when run on the CPU is much faster than an equivalent GCC/VS 2008 compiled code.

 

Without knowing what that 'equivalent code' is, there's no way to make any useful comment.  Is it using both cpu cores for example?  Cache coherency?

FWIW A few times i've tried it, my 'equivalent code' in Java runs about the same speed as the AMD OpenCL CPU driver does (when using the same number of cpu threads), so I suspect it has more to do with the implementation and algorithm than anything else.

OpenCL C does have a tighter memory model which allows for some optimisations more easily compared to C, but for the most part it is basically the same as using the 'restrict' keyword.  And work-group sizes provide a potentially cache-friendly blocking factor to most algorithms.

 

0 Likes

Originally posted by: notzed

Without knowing what that 'equivalent code' is, there's no way to make any useful comment.  Is it using both cpu cores for example?  Cache coherency?

 

Hi,

I did mention that I used a standard Nvidia OpenCL sample in their SDK. And, the actual sample is called "oclPostprocessGL". The "equivalent code" is part of that sample, and includes a C language CPU-side implementation of the blurring code that otherwise runs on the GPU.

Sincerely,

Joofa

0 Likes

Right.  Well a couple of ideas:

a) the 'equivalent code' is executed on only one thread: opencl will use all cpu cores.  2x just there.

b) if you have the 'local mem' option turned on (as appears the default), the opencl implementation might be more cache friendly.

c) maybe the compiler is auto-vectorising some stuff/just working better with this type of problem (but really you wouldn't expect such a performance gap from this).

d) maybe it has a more efficient pathway to the GL interop (could be a big impact)

e) opencl might be implementing divide as a *(1/N), whereas gcc is using /.

f) I don't see why the simpler memory model would help much here, but you never know.

Apart from the 2x from the cpu threads (duo core = 2 h/w threads?), it does seem somewhat out of proportion though. 

The CPU output looks really slow for a modern machine, so my guess much of it it is something beyond the compiled code itself.

 

0 Likes

Originally posted by: notzed Right.  Well a couple of ideas:

 

a) the 'equivalent code' is executed on only one thread: opencl will use all cpu cores.  2x just there.

Yes. However, in the video I posted the speed difference seem much bigger than 2x.

 

d) maybe it has a more efficient pathway to the GL interop (could be a big impact)

Not sure if the GL interop is the culprit here.

Apart from the 2x from the cpu threads (duo core = 2 h/w threads?), it does seem somewhat out of proportion though. 

Yes.

c) maybe the compiler is auto-vectorising some stuff/just working better with this type of problem (but really you wouldn't expect such a performance gap from this).

e) opencl might be implementing divide as a *(1/N), whereas gcc is using /.

f) I don't see why the simpler memory model would help much here, but you never know.

Here are the assembly language outputs from the various compilers:

NVIDIA 320M (sm12) on Mac GPU:

http://djjoofa.com/data/code/postprocessGL_cl_asm_NVIDIA_320M_s12.html

GCC 4.2.1 on Mac CPU:

http://djjoofa.com/data/code/postprocessGL_Host_asm_gcc_osx.html

AMD APP SDK on XP CPU:

http://djjoofa.com/data/code/postprocessGL_cl_asm_AMD_APP_CPU_xp.html

VS 2008 on XP CPU:

http://djjoofa.com/data/code/postprocessGL_Host_asm_vs2008_xp.html

MinGW GCC on XP CPU:

http://djjoofa.com/data/code/postprocessGL_Host_asm_mingw_xp.html

Joofa

0 Likes