We've got a very simple kernel here, which does a copy from one cl_float4 array into another, and yet it crashes when running on the CPU. The compiled x86 kernel consists basically of some address computation and a movaps call, and the movaps call gets an unaligned address (8-byte aligned instead of 16 byte.) It works fine on the GPU.
Is this a known bug? Anything we can do to work around this?