IIRC There are basically three ways for getting reads & writes faster than global memory access:
1. use cached buffers
2. use Images
3. use constant cache.
Refer to the samples constant memory bandwidth & Buffer Bandwidth for details. Also refer to the chapter4 OpenCL Programming Guide(Memory Transfer Optimizations).
If you use constant address space, the data is cached. Images and restricted read-only global buffers are also cached.