How well do algorithms that use look-up tables perform when you implement the table as an image for the parallel version? For instance, how well would the "Two Plus Two" Poker evaluator be expected to perform on a GPU if the 133MB look-up table were encoded as an image and read using samplers?
Do image reads need to be coalesced like buffer reads for there to be maximal performance?
See the section in the programming guide about memory tiling.
In short: for maximum performance image access should be 2d-coherent, and the cache is so small it has to be pretty closely coherent.
For random access pattern a simple array might be better, unless the 8-bit 'float' access is useful.