Hello.

I try to adapt source code of the NBody problem to fit my needs (forces, energy and distances between atoms). X, Y, and Z are coordinates, C is charge, R (also dx, dy and dz) is for result and "limit" is number of atoms. Coordinates and results are stored in 1D tables.

The following kernel code works correct on CPU (it uses all 8 cores), but it somehow don't work correct on GPU. Results are large numbers or "-nan".

For example:

GPU (incorrect):

dx=-17935.105469, dy=4237.721680, dz=-23556.548828

dx=-nan, dy=-nan, dz=-nan

CPU (correct):

dx=1.314520, dy=-4.063989, dz=1.077236

dx=-1.314520, dy=4.063989, dz=-1.077236

The only difference between CPU and GPU code is that I change "CL_DEVICE_TYPE_CPU" on device to "CL_DEVICE_TYPE_GPU" via method "clGetDeviceIDs".

Does anyone have any idea why this code doesn't work on GPU? I have a Radeon HD6850 1GB GPU, Intel i7 2600K and Gentoo Linux.

Here is my kernel code:

__kernel void vector_add(__global float *X, __global float *Y, __global float *Z, __global float *C, __global float *R , const int limit, __global float *dx , __global float *dy , __global float *dz) { int i = get_global_id(0); int n; float distance; float e, tx, ty, tz; float stx=0, sty=0, stz=0; float cutoff = 2.0; float cutx, f, cg, tf, grms=0.0; cutx = cutoff*cutoff; for(n=i+1; n<limit; n++){ distance=(pow(X[n]-X[i],2)+pow(Y[n]-Y[i],2)+pow(Z[n]-Z[i],2)); if (distance <= cutoff){ f=(1.0-distance)/cutx; cg=(C[i]*C[n])/sqrt(distance); e= cg*f*f; R[i]=distance; tf = -e/(distance-4.0*cg*f/cutx); tx = tf*(X[n]-X[i]); ty = tf*(Y[n]-Y[i]); tz = tf*(Z[n]-Z[i]); dx[n] = dx[n] - tx; dy[n] = dy[n] - ty; dz[n] = dz[n] - tz; stx = stx + tx; sty = sty + ty; stz = stz + tz; } dx[i] = dx[i] + stx; dy[i] = dy[i] + sty; dz[i] = dz[i] + stz; grms = grms + pow(dx[i],2)+pow(dy[i],2)+pow(dz[i],2); } }