I'm porting a prefix scan algorithm to run on the CPU. It won't work because the CPU has an effective wavefront size of 1. The wavefront is quite critical for the performance on the GPU. Is it possible to write similar OpenCL code for both devices yet ensure optimal performance on both? I don't know enough about AVX/SSE at this point to know whether this question makes sense :>(.