I can find many things providing help and/or examples about writing and tuning kernel code for GPUs, but I've not really been able to find much to help with some of the performance tweaking with respect to writing kernels for CPU utilization, in particular, structuring code/logic to maximize SSE usage/benefit. How can one structure loops/code etc. to maximize the likelihood that SSE can/will be used?
Also, seems to me that writing and tuning a kernel for GPU usage would not be optimal for CPU. Clearly if one has normal multi-threaded programming experience, then the basic concepts for structuring the kernel are fairly easy. I'm not looking for something that basic. More something tying in how the kernel code is compiled and how to leverage that.