3 Replies Latest reply on Aug 3, 2011 5:40 AM by MatAle

    NBody GPU problem




      I try to adapt source code of the NBody problem to fit my needs (forces, energy and distances between atoms). X, Y, and Z are coordinates, C is charge, R (also dx, dy and dz) is for result and "limit" is number of atoms. Coordinates and results are stored in 1D tables.

      The following kernel code works correct on CPU (it uses all 8 cores), but it somehow don't work correct on GPU. Results are large numbers or "-nan". 

      For example:

      GPU (incorrect):

      dx=-17935.105469, dy=4237.721680, dz=-23556.548828

      dx=-nan, dy=-nan, dz=-nan

      CPU (correct):

      dx=1.314520, dy=-4.063989, dz=1.077236

      dx=-1.314520, dy=4.063989, dz=-1.077236


      The only difference between CPU and GPU code is that I change "CL_DEVICE_TYPE_CPU" on device to "CL_DEVICE_TYPE_GPU" via method "clGetDeviceIDs".

      Does anyone have any idea why this code doesn't work on GPU? I have a Radeon HD6850 1GB GPU, Intel i7 2600K and Gentoo Linux.

      Here is my kernel code:


      __kernel void vector_add(__global float *X, __global float *Y, __global float *Z, __global float *C, __global float *R , const int limit, __global float *dx , __global float *dy , __global float *dz) { int i = get_global_id(0); int n; float distance; float e, tx, ty, tz; float stx=0, sty=0, stz=0; float cutoff = 2.0; float cutx, f, cg, tf, grms=0.0; cutx = cutoff*cutoff; for(n=i+1; n<limit; n++){ distance=(pow(X[n]-X[i],2)+pow(Y[n]-Y[i],2)+pow(Z[n]-Z[i],2)); if (distance <= cutoff){ f=(1.0-distance)/cutx; cg=(C[i]*C[n])/sqrt(distance); e= cg*f*f; R[i]=distance; tf = -e/(distance-4.0*cg*f/cutx); tx = tf*(X[n]-X[i]); ty = tf*(Y[n]-Y[i]); tz = tf*(Z[n]-Z[i]); dx[n] = dx[n] - tx; dy[n] = dy[n] - ty; dz[n] = dz[n] - tz; stx = stx + tx; sty = sty + ty; stz = stz + tz; } dx[i] = dx[i] + stx; dy[i] = dy[i] + sty; dz[i] = dz[i] + stz; grms = grms + pow(dx[i],2)+pow(dy[i],2)+pow(dz[i],2); } }

        • NBody GPU problem

          The numbers seem uninitialized. Did you define the memory buffers  (in/out/in-out) according to their use? Do you copy all buffers you need to the GPU and back?

            • NBody GPU problem


              Thank you for your reply. 

              I already found a solution. 1D "dx", "dy" and "dz" tables have not been initialized correctly. Now the code works on GPU and CPU. But I have new problem because the GPU is not as fast as CPU (or even slower). I think that there is nothing wrong with transfering tables on GPU and back, but in the "distribution" of threads. Currently I am a bit lost as how to optimize the software code that would operate on the GPU faster.  

              Does Anyone have an idea?