7 Replies Latest reply on Jun 22, 2011 1:43 PM by maximmoroz

    Help with effective parallelization/synchronization

    omgi

      I need some help with ideas on how to efficiently parallelize an algorithm and synchronize threads. The algorithm I want to parallelize is based on Metropolis/Markov Chain Monte Carlo. I am quite new to OpenCL programing and have up to this point only done "naive" parallelizations where each thread runs the kernel independently without synchronization etc.

      The problem consist of an array of vectors arranged according this image: http://img204.imageshack.us/img204/7905/problemlayout.png
      N (points per vector) is typically in the interval 8 - 2048
      E (ensemlbes) is varying, but typically 10^2, 10^3 or something similair
      D (vectors per ensemble) is typically in the interval 1-3, but might be increased to > 60 in the future

      I want to perform an algorithm on each node in every vector, typically10^3 iterations per point. Each iteration of a point is a permanent (random) modification. The problem with parallelizing is then:
      1. Each iteration is dependent on neighbouring points in the same vector, and the same index in the other vectors in the same ensemble. Image: http://img137.imageshack.us/img137/6598/dependence.png
      2. The problem requires me to do only one iteration in succession on the same point, and then move on to other points.

      My idea is that each thread is assigned to one index between 1 and N in an ensemble, ordered so that the threads are spread out in every second index. The thread then loops through that index of every vector in its ensemble. This is illustrated in this image: http://img863.imageshack.us/img863/3449/loop.png

      My question are:
      1. Should I try to have one work group per ensemble and synchronize between threads, or should I have multiple work groups per ensemble and be forced to synchronize between work groups?
      2. How do I get a thread to change to an "available" node in a good way? By available, I mean a node that has no neighbours that other threads are working on.


      If anyone have better ideas, they are more than welcome to present them.

        • Help with effective parallelization/synchronization
          maximmoroz

          Hello omgi,

          I have a question:

          > Each iteration is dependent on neighbouring points in the same vector, and the same index in the other vectors in the same ensemble

          Could you be more specific here, please? Does the (k+1)-iteration value of the point depend on k-iteration values of itself and points within the same ensemble according to http://img137.imageshack.us/img137/6598/dependence.png ? Or it depends on (k+1)-iteration values of some near-adjusted points and on k-iteration values of others?

            • Help with effective parallelization/synchronization
              omgi

              Hi maximmoroz,

              I'm sorry for the confusing description. The algorithm is run multiple times for every point in every vector. Each run modifies a point randomly, so the next time we get to the same point will be different. In the algorithm, the value of the current point is modified based on the value of the neighbouring points. That is, when we are at point 100 in vector 1 in an ensemble, the algorithm is dependent on point 99 and 101 in vector 1, as well as dependent of point 100 in vectors 2, 3, 4 etc. If you want a physical interpretation, you can imagine that each vector in D corresponds to a particle, and each point in the vector is a spatial coordinate at a certain time. We want to calculate the energy of the particle along this path. The energy is based on the kinetic energy (derivative of neighbouring points for the same particle) + the potential energy (dependent of the other particles location at the same time).

               

              I dont know if some code might make it less confusing... :

              // N: Points per vector // D: Vectors per ensemble // E: Total amount of ensembles // Kernel input: __global float vectors[N*D*E]; unsigned int e = 10; // Choose the 10th ensemble __local float local_vectors[N*D]; // All the vector points in an ensemble // Save vectors to local for(int i = 0; i < (N*D); i++) { local_vector[i] = vectors[N*D*(e-1) + i]; } unsigned int current_point; unsigned int left_point; unsigned int right_point; float new_value; float old_value; for(int d = 0; d < D; d++) // Loop through vectors { for(int n = 0; n < N; n++) // Loop through points in vectors { current_point = d*N + n; // Current point in the current vector left_point = current_point - 1; right_point = current_point + 1; // Pretend that we have code here that sets right_point = 0 if current_point = (N-1), etc... old_value = local_vectors[current_point]; new_value = old_value + randomFloat(); // Pretend that such a function exists new_value += local_vectors[left_point] + local_vectors[right_point]; new_value += /* Values of same point in other vectors */ // UPDATE MODIFICATION! local_vectors[current_point] = new_value; } }

                • Help with effective parallelization/synchronization
                  maximmoroz

                  > I'm sorry for the confusing description.

                  omgi, the description is fine.

                  > The algorithm is run multiple times for every point in every vector.

                  I got it.

                  > Each run modifies a point randomly, so the next time we get to the same point will be different

                  It is clear. My question was: Is it Ok to base (k+1) modification of the single point on previous state (that is k) of some adjustant points? If the answer is yes then the most straightforward and most likely the most efficient way to organize calculations is to have 2 copies of buffers: one with result of the previous iteration (used in read-only mode by the kernel), the other - with result of the current operation (used in write-only mode by the kernel). Then you initialize the 1st buffer with data setArg for the kernel accordingly and enqeue the 1st iteration, then swap arguments for the same kernel and enqeue the 2nd iteration, then swap arguments again and enqeue the 3rd iteration and so on.

                  Thus you are free to organize the kernel the most efficient way (minimizing global memory reads by using local memory).