I need some help with ideas on how to efficiently parallelize an algorithm and synchronize threads. The algorithm I want to parallelize is based on Metropolis/Markov Chain Monte Carlo. I am quite new to OpenCL programing and have up to this point only done "naive" parallelizations where each thread runs the kernel independently without synchronization etc.
The problem consist of an array of vectors arranged according this image: http://img204.imageshack.us/img204/7905/problemlayout.png
N (points per vector) is typically in the interval 8 - 2048
E (ensemlbes) is varying, but typically 10^2, 10^3 or something similair
D (vectors per ensemble) is typically in the interval 1-3, but might be increased to > 60 in the future
I want to perform an algorithm on each node in every vector, typically10^3 iterations per point. Each iteration of a point is a permanent (random) modification. The problem with parallelizing is then:
1. Each iteration is dependent on neighbouring points in the same vector, and the same index in the other vectors in the same ensemble. Image: http://img137.imageshack.us/img137/6598/dependence.png
2. The problem requires me to do only one iteration in succession on the same point, and then move on to other points.
My idea is that each thread is assigned to one index between 1 and N in an ensemble, ordered so that the threads are spread out in every second index. The thread then loops through that index of every vector in its ensemble. This is illustrated in this image: http://img863.imageshack.us/img863/3449/loop.png
My question are:
1. Should I try to have one work group per ensemble and synchronize between threads, or should I have multiple work groups per ensemble and be forced to synchronize between work groups?
2. How do I get a thread to change to an "available" node in a good way? By available, I mean a node that has no neighbours that other threads are working on.
If anyone have better ideas, they are more than welcome to present them.