I took a standard NVIDIA OpenCL sample (that simply blurs an input image) and ran it on my Mac Book Air (MBA) that has an NVIDIA 320M GPU using the Apple-supplied OpenCL 1.0 framework. I wanted to compare it with AMD APP SDK. So, I ran a Windows XP SP3 session on the MBA using VmWare Fusion 3.1.3. Apparently, I could not access the GPU under the virtual XP environment. However, fortunately, APP SDK allowed me to run the OpenCL kerenl on the CPU. I also tried an equivalent code on the CPU that does the same thing as the OpenCL kernel. The equivalent code was compiled using GCC on Mac and Visual Studio 2008 on XP. The equivalent codes on both Mac and XP ran more or less at the same speed and much slower than the GPU code on the Mac, as expected. I was expecting the OpenCL kernel run on the CPU under XP through the APP SDK to show a similarly slower peformance as the equivalent codes. However, I was pleasently suprised to see that it ran much faster and just a tad slower than the Mac GPU code. The following link shows a video of all four configurations.
At this stage I'm not sure why AMD APP SDK OpenCL kerenl when run on the CPU is much faster than an equivalent GCC/VS 2008 compiled code.