
Help with effective parallelization/synchronization
maximmoroz Jun 19, 2011 7:44 AM (in response to omgi)Hello omgi,
I have a question:
> Each iteration is dependent on neighbouring points in the same vector, and the same index in the other vectors in the same ensemble
Could you be more specific here, please? Does the (k+1)iteration value of the point depend on kiteration values of itself and points within the same ensemble according to http://img137.imageshack.us/img137/6598/dependence.png ? Or it depends on (k+1)iteration values of some nearadjusted points and on kiteration values of others?

Help with effective parallelization/synchronization
omgi Jun 19, 2011 9:32 AM (in response to maximmoroz)Hi maximmoroz,
I'm sorry for the confusing description. The algorithm is run multiple times for every point in every vector. Each run modifies a point randomly, so the next time we get to the same point will be different. In the algorithm, the value of the current point is modified based on the value of the neighbouring points. That is, when we are at point 100 in vector 1 in an ensemble, the algorithm is dependent on point 99 and 101 in vector 1, as well as dependent of point 100 in vectors 2, 3, 4 etc. If you want a physical interpretation, you can imagine that each vector in D corresponds to a particle, and each point in the vector is a spatial coordinate at a certain time. We want to calculate the energy of the particle along this path. The energy is based on the kinetic energy (derivative of neighbouring points for the same particle) + the potential energy (dependent of the other particles location at the same time).
I dont know if some code might make it less confusing... :
// N: Points per vector // D: Vectors per ensemble // E: Total amount of ensembles // Kernel input: __global float vectors[N*D*E]; unsigned int e = 10; // Choose the 10th ensemble __local float local_vectors[N*D]; // All the vector points in an ensemble // Save vectors to local for(int i = 0; i < (N*D); i++) { local_vector[i] = vectors[N*D*(e1) + i]; } unsigned int current_point; unsigned int left_point; unsigned int right_point; float new_value; float old_value; for(int d = 0; d < D; d++) // Loop through vectors { for(int n = 0; n < N; n++) // Loop through points in vectors { current_point = d*N + n; // Current point in the current vector left_point = current_point  1; right_point = current_point + 1; // Pretend that we have code here that sets right_point = 0 if current_point = (N1), etc... old_value = local_vectors[current_point]; new_value = old_value + randomFloat(); // Pretend that such a function exists new_value += local_vectors[left_point] + local_vectors[right_point]; new_value += /* Values of same point in other vectors */ // UPDATE MODIFICATION! local_vectors[current_point] = new_value; } }

Help with effective parallelization/synchronization
maximmoroz Jun 19, 2011 10:07 AM (in response to omgi)> I'm sorry for the confusing description.
omgi, the description is fine.
> The algorithm is run multiple times for every point in every vector.
I got it.
> Each run modifies a point randomly, so the next time we get to the same point will be different
It is clear. My question was: Is it Ok to base (k+1) modification of the single point on previous state (that is k) of some adjustant points? If the answer is yes then the most straightforward and most likely the most efficient way to organize calculations is to have 2 copies of buffers: one with result of the previous iteration (used in readonly mode by the kernel), the other  with result of the current operation (used in writeonly mode by the kernel). Then you initialize the 1st buffer with data setArg for the kernel accordingly and enqeue the 1st iteration, then swap arguments for the same kernel and enqeue the 2nd iteration, then swap arguments again and enqeue the 3rd iteration and so on.
Thus you are free to organize the kernel the most efficient way (minimizing global memory reads by using local memory).

Help with effective parallelization/synchronization
omgi Jun 22, 2011 5:11 AM (in response to maximmoroz)Hm so what you are suggesting is basically, instead of doing a loop through work objects in the kernel, I do a loop on the host and enque the kernel with different arguents every run (through setArg)? Have I understood it correct? If so, wont I have the kernel overhead for every call?

Help with effective parallelization/synchronization
maximmoroz Jun 22, 2011 9:58 AM (in response to omgi)My idea is that you will have single enqueueNDRange function call for each iteration ("I want to perform an algorithm on each node in every vector, typically10^3 iterations per point"). Thus you will have 10^3 enqueueNDRange. Not a big deal, from my point of view.

Help with effective parallelization/synchronization
omgi Jun 22, 2011 12:43 PM (in response to maximmoroz)Hm the actual rerun of enqueueNDRange will be severa orders of magnitude larger, so it might be an issue, but I will test it. Thank you for the help!
Is there any estimation to how much faster the access to read_only/write_only memory is, compared with normal local memory?

Help with effective parallelization/synchronization
maximmoroz Jun 22, 2011 1:43 PM (in response to omgi)Yep, test it. Overhead for running 10^3 kernels might be just several miliseconds. Of course you better enqeue all 10^3 iterations and flush the command queue afterwards.





