Hello.
I have a small problem with the accuracy of calculations. The kernel (code below) returns different results for the same input data. The difference is minimal.
For example (the program's output):
GRMS: 70.764435 (first run), 70.583441 (second run), 70.563168, ...
The results are stored in dx[N], dy[N], dz[N] and R[N]. GRMS is average value of dx[N], dy[N], dz[N].
X[N], Y[N] and Z[N] are coordinates and C[N] is for charge (input data) - atoms.
Does anyone have an idea what could be wrong? Why such a small differenc in calculations?
__kernel void vector_add(__global float *X, __global float *Y, __global float *Z, __global float *C, __global float *R , const int limit, __global float *dx , __global float *dy , __global float *dz) { int i = get_global_id(0); int n=0; float distance=0.0; float e, tx, ty, tz,stx,sty,stz; float cutoff = 10.0; float cutx, f, cg, tf, rr; R=0; dx=0; dy=0; dz=0; dx[i+1]=0; dy[i+1]=0; dz[i+1]=0; cutx = cutoff*cutoff; for(n=i+1; n<limit; n++){ stx=0.0,sty=0.0,stz=0.0; distance=sqrt((pow(X
-X,2)+pow(Y -Y,2)+pow(Z -Z,2))); if (distance <= cutoff ){ rr=distance*distance; f=1.0-rr/cutx; cg=(C*C )/distance; e=cg*f*f; R+=e; tf = -(e/rr)-(4.0*cg*f)/cutx; tx = tf*(X -X); ty = tf*(Y -Y); tz = tf*(Z -Z); dx = dx + tx; dy = dy + ty; dz = dz + tz; stx = stx + tx; sty = sty + ty; stz = stz + tz; } dx = dx - stx; dy = dy - sty; dz = dz - stz; } }
why do write into dx[i+1] = 0 ...
It is a design problem.
Work-item of global id i reads and writes buffers dx, dy and dz for indexes [i, limit]. Without any synchronization. It is a complete mess.
Thanks. I try with barriers (CLK_GLOBAL_MEM_FENCE) but it does not help. Do you have any advice what can I use besides the variable "i" in order to synchronize the implementation?
You cannot synchronize between work groups. Synchronization only happens among work items within a work group. You need to redesign your algorithm so that it doesn't need synchronization and so that work items can execute in any order and you still get correct results.
Parallelize the algorithm. Make each work-item independent on any another.