As I see it, the only reasonable implementation hostside is to use native endianness for vector operations.
The reasons I have for this article of faith are that the portability appendix in the standard talks about OpenCL handling endianness issues between host and GPU automatically as long as you work with full vector types and this would imply element swapping on the OpenCL side if endianness differs.
It would also imply that it should not be swapped if you were using a CPU device because your CPUs presumably share endianness.
So I guess that boils down to that host order isn't implementation defined, but is simply defined by the endianness of the host. If you use intrinsics and vector operations consistently hostside there should not be any issues, but if parts of your code make assumptions about endianness you probably need separate code paths for big and little endian. Assuming you care about that kind of portability that is. You're not likely to find implementation differences within the PC world.
Yeah my understanding is also that it should be safe as long as no assumptions are made on my part. Also I just read in the standard about the .s[<index>] way of accessing vector elements. I actually thought the standard didn't specify a way of accessing vector elements in host code based on some other posts I've read, but it seems I was wrong.
So it seems the standard mandates what I need and more, so everything is peachy