I am not familiar with the algorithm. But for accessing an indexed array fast you can use any of these alternatives depending on the access pattern and data properties:
Use constant cache if the data is known before compile time. You need to provide static indexing in the kernel in order to store data inside constant cache.The data limit is quite low and may vary from gpu to gpu.
Use LDS, if you need to fetch the same data elements many times or you need to fetch a particular set of data from a particular workgroup.LDS is common to all work items in a compute unit.
You can have the benifit of texture cache by using images.
Refer to openCL Programming Guide for details
May I ask which cipher you are implemneting ?
If the S-box is 8 bits in and 8 bits out, packing it into 64 register, and using indexed registers (IL feature) should give good speed.