I can find many things providing help and/or examples about writing and tuning kernel code for GPUs, but I've not really been able to find much to help with some of the performance tweaking with respect to writing kernels for CPU utilization, in particular, structuring code/logic to maximize SSE usage/benefit. How can one structure loops/code etc. to maximize the likelihood that SSE can/will be used?
Also, seems to me that writing and tuning a kernel for GPU usage would not be optimal for CPU. Clearly if one has normal multi-threaded programming experience, then the basic concepts for structuring the kernel are fairly easy. I'm not looking for something that basic. More something tying in how the kernel code is compiled and how to leverage that.
Using float4/int4 definitely seems to be a huge part of it, for SSE anyway which is all I think OpenCL supports. But it seems like only single line ops on 4x arrays are converted. I suppose that makes sense, but gonna be hard for me to figure out how to do that successfully with image data.
I also found these presentations by Intel. If one ignores the AVX stuff, the SSE and surrounding information is still valuable:
Thanks for the pointer to the Parallel Min() doc, it does point out some interesting ways to calculate optimal accessing at runtime.