Its could be something like:
- load each threads own cell from global memory to local memory using all threads in the workgroup.
- load security paddings from global memory to local memory using the surface threads(2D thread group-->4x 1D edges) so it doesnt go out of bounds when using stencil on interior-edges
- synchronize threads on local memory using barrier
- do whatever your kernel does but use local memory this time instead of global memory.
what you need here is:
- each thread's local thread id
- predefined local array
- barrier (on local memory)
- some access pattern that doesn't conflict on local memory banks much
a simple additional offset to access local memory should be used to overcome paddings.