My reply is slightly off topic, but the short version is have you run your code through CodeAnalyst? Its free, so if you're going to complain about optimized code running slow, you'll probably want to.
It should show you if you're consistantly generating cache misses, mispredicted branches, or not taking full advantage of Out of Order Execution (getting your loads as early as possible, and not generating dependencies on stores), etc. Should make it pretty easy to speed up, especially if you're working in ASM. It will most certainly help C/C++ code as well though, you just don't get 100% control over the hardware, so you have to be a little more creative in getting the compiler to do the right thing. 🙂
I have no idea how much time either major x86 chip vendor is spending on OpenCL for the CPU, but if I had to guess, I'd say intel is probably doing more because they're betting their GPU market on x86 with Knights Ferry, then again, those x86 cores are stripped down, so the same optimization techniques may not apply...
Yeah, I'd say if I wanted something fast on the CPU, I'd look at CodeAnalyzer, vTune (err "Parralel Studio XE" so you've got their optimizing compiler as well), and generate specific versions of the code targeting specific microarchitectures....although, I'd probably not waste my time with netburst....its just all around trash....can you tell I dislike netburst? 🙂
Edit: Or I could be wrong, Micah posted in a different thread comparing CPU and GPU performance on a mac with this link http://dl.acm.org/citation.cfm?id=1854302 Seems AMD might be doing more to optimize OpenCL on the CPU 🙂