I made this suggestion on the Khronos OpenCL forum, but I'd really like to see AMD implement it since you're my target platform (and hopefully sooner than it would take Khronos to add it to the OpenCL specification).
I ran into a case where it would be really useful to reverse the vector components. To create fully compliant BLAS functions, specifically BLAS1 since AMD's clBlas doesn't offer those yet, one must account for when the steps between elements (e.g., incx), are negative. For AMD hardware in general, elements should be explicitly packed into vector types to get the best performance (using SSE and what not).
To treat negative and non-unit increments, I copy the relevant elements from low-to-high global memory addresses to low-to-high local memory. Then I do a vload from local memory into private memory, where I currently shuffle (if needed), compute, and shuffle (again if needed) before doing a vstore to local memory, and then back into global memory.
Since the OpenCL specification already includes .hi, .lo, .even, .odd, I think .rev would be a natural addition. Of course I can continue to just use the built-in shuffle function, but then I need to create a reverse mask for each vector length. I think .hi, .lo, .even, and .odd being already in the spec. makes a reasonable argument to include .rev as well.
Just a request, please let me know what you may think about this.