How would you do it normally on the CPU? I would assume an unrolled SSE loop per core. So do that in OpenCL too, but for your convenience you can use the vector types instead of SSE intrinsics.
It depends on what you're trying to achieve, though. If you're just doing a massive memcpy you might be better off creating a native kernel that calls an external library with a parallel memcpy routine.