I am currently in a situation where I practically have to choose between parallelizing more, or using local memory. I am currently passing in a couple of vectors that are several orders larger than the optimal work group size. I want to perform only one set of calculations per element in each vector, move on to the next and then repeat many times (so many times that the kernel overhead would be a problem if re-running between each iteration). The problem with using several work groups is that I have much overlap, and would need to synchronize between them. The optimal would be to have everything in the same work group, but it is more than fits in local.
1. Is it even worth it to store to local before using, if I only use each element a few times?
2. Is it worth it to have "full" parallelization if I have to work with global memory all the time, VS very small parallelization but being able to use local memory?
3. Is it possible to have data stored in local memory, that will "live on" there between kernel re-runs, or is the local memory tied to each run of the kernel?