Hi NURBS,
A few general suggestions i can give is:
1. Do not do it on GPU unless data is already on GPU because of some previous procesing, or is required afterwards for more processing.
2. If you are still interested, then try to use caches. You have got L1 and L2 caches. If the access pattern is tile-based, try using images and get benifits from 2-D L2 cache. Else stick to sequential access patterns.
3. Make sure that consecutive workitems , specially inside same wavefront access consecutive memory locations. Also try to write code so that a wavefront only uses a single channel for global access, so many wavefronts can run together.
I hope these tips help. But the final decision is in your hands based on your problem.