6 Replies Latest reply on Jan 26, 2010 7:06 AM by genaganna

    Kernel does not run when outside profiler

    ljbade

      I am trying to port sailfish (http://gitorious.org/sailfish) to Windows (http://gitorious.org/~ljbade/sailfish/ljbades-sailfish

      But when I run the example scripts I get only black pixel output, and the simulation frame counter goes crazy fast.

      It seems the kernel is not being run, despite the enqueued kerned and the enqueued read completing without generating any errors.

      But if I run the script via the profiler (clprofile.exe), it works fine!

      I am running this on the CPU (Phenom II 720 BE).

      Is there any sort of debug log I can look at?

      Any ideas what the issue might be?

        • Kernel does not run when outside profiler
          genaganna

           

          Originally posted by: ljbade I am trying to port sailfish (http://gitorious.org/sailfish) to Windows (http://gitorious.org/~ljbade/sailfish/ljbades-sailfish

           

          But when I run the example scripts I get only black pixel output, and the simulation frame counter goes crazy fast.

           

          It seems the kernel is not being run, despite the enqueued kerned and the enqueued read completing without generating any errors.

           

          But if I run the script via the profiler (clprofile.exe), it works fine!

           

          I am running this on the CPU (Phenom II 720 BE).

           

          Is there any sort of debug log I can look at?

           

          Any ideas what the issue might be?

           

          Ljbade,

                   clprofile is vaild only for GPU.  Are you sure that runs on CPU?

            • Kernel does not run when outside profiler
              ljbade

              Yes it is running on the CPU (all 4 cores go to 90% load).

              And considering I don't even have a OpenCL compatible GPU, it is impossible for it to use my GPU.

              I have also modified pyopencl to use the pfn_notfy on clCreateContext, and that never gets called either...

                • Kernel does not run when outside profiler
                  genaganna

                   

                  Originally posted by: ljbade Yes it is running on the CPU (all 4 cores go to 90% load).

                   

                  And considering I don't even have a OpenCL compatible GPU, it is impossible for it to use my GPU.

                   

                  I have also modified pyopencl to use the pfn_notfy on clCreateContext, and that never gets called either...

                   

                  It is very difficult for me to look .py code. It would be good if you give code in C/C++.

                  What do you mean by running script via clprofile.exe?  Handling error code returned by API is only the way to know what is going wrong.

                    • Kernel does not run when outside profiler
                      ljbade

                      By clprofile.exe, mean the one in C:\Program Files (x86)\ATI Stream\Tools\ATI Stream Profiler 1.0\x86

                       

                      I simply run it like this from command prompt: clprofile.ex python.exe C:\sailfish\examples\lbm_ldc.py

                      That works fine (but slower due to profile overhead), but if I run

                      python.exe C:\sailfish\examples\lbm_ldc.py

                      I get black pixels and the kernel never runs (no load on CPU)

                      The code\command doesn't change, only the clprofile hosting enviroment. So there is something clprofile is doing that causes kernel to run...

                      The API never seems to return an error code either... thats why I was wondering if ATI Stream can generate a debug log.

                      I will try to create some pseudo-C code for you to look at.

                        • Kernel does not run when outside profiler
                          ljbade

                          OK here is the autogenerated opencl code.

                          This project appears to use a template generating system to create custom kernels on the fly depending on the dimensions etc. that can target both cuda and opencl with the same code.

                           

                          #define BLOCK_SIZE 64 #define DIST_SIZE 9216 #define GEO_FLUID 0 #define GEO_BCV 2 #define GEO_BCP 2 #define DT 1.0f __constant float tau = 0.503f; // relaxation time __constant float visc = 0.001f; // viscosity __constant float geo_params[1] = { 0}; // geometry parameters typedef struct Dist { float fC; float fE; float fN; float fW; float fS; float fNE; float fNW; float fSW; float fSE; } Dist; // // Copy the idx-th distribution from din into dout. // inline void getDist(Dist *dout, __global float *din, int idx) { dout->fC = din[idx + DIST_SIZE*0]; dout->fE = din[idx + DIST_SIZE*1]; dout->fN = din[idx + DIST_SIZE*2]; dout->fW = din[idx + DIST_SIZE*3]; dout->fS = din[idx + DIST_SIZE*4]; dout->fNE = din[idx + DIST_SIZE*5]; dout->fNW = din[idx + DIST_SIZE*6]; dout->fSW = din[idx + DIST_SIZE*7]; dout->fSE = din[idx + DIST_SIZE*8]; } inline bool isFluidNode(int type) { return type == 0; } inline bool isWallNode(int type) { return type == 1; } inline bool isFluidOrWallNode(int type) { return type <= 1; } // This assumes we're dealing with a wall node. inline bool isVelocityNode(int type) { return (type >= 2) && (type < GEO_BCP); } inline bool isVelocityOrPressureNode(int type) { return (type >= 2); } inline bool isPressureNode(int type) { return (type >= 2); } inline void decodeNodeType(int nodetype, int *orientation, int *type) { *orientation = nodetype & 7; *type = nodetype >> 3; } inline void bounce_back(Dist *fi) { float t; t = fi->fE; fi->fE = fi->fW; fi->fW = t; t = fi->fN; fi->fN = fi->fS; fi->fS = t; t = fi->fNE; fi->fNE = fi->fSW; fi->fSW = t; t = fi->fNW; fi->fNW = fi->fSE; fi->fSE = t; } inline void compute_macro_quant(Dist *fi, float *rho, float *v) { *rho = fi->fC + fi->fE + fi->fN + fi->fNE + fi->fNW + fi->fS + fi->fSE + fi->fSW + fi->fW; v[0] = (fi->fE + fi->fNE + fi->fSE - fi->fNW - fi->fSW - fi->fW)/(*rho) ; v[1] = (fi->fN + fi->fNE + fi->fNW - fi->fS - fi->fSE - fi->fSW)/(*rho) ; } // // Get macroscopic density rho and velocity v given a distribution fi, and // the node class node_type. // inline void getMacro(Dist *fi, int node_type, int orientation, float *rho, float *v) { if (isFluidOrWallNode(node_type) || orientation == 0) { compute_macro_quant(fi, rho, v); if (isWallNode(node_type)) { v[0] = 0.0f; v[1] = 0.0f; } } else if (isVelocityNode(node_type)) { compute_macro_quant(fi, rho, v); } else if (isPressureNode(node_type)) { compute_macro_quant(fi, rho, v); } if (!isWallNode(node_type)) { v[0] += 0.00000018903591682420f; } } inline void boundaryConditions(Dist *fi, int node_type, int orientation, float *rho, float *v) { if (isWallNode(node_type)) { bounce_back(fi); } } // // A kernel to update the position of tracer particles. // // Each thread updates the position of a single particle using Euler's algorithm. // __kernel void LBMUpdateTracerParticles(__global float *dist, __global int *map, __global float *x, __global float *y ) { float rho, v[2]; int gi = get_global_id(0); float cx = x[gi]; float cy = y[gi]; int ix = (int)(cx); int iy = (int)(cy); // Sanity checks. if (iy < 0) iy = 0; if (ix < 0) ix = 0; if (ix > 191) ix = 191; if (iy > 47) iy = 47; int idx = ix + 192*iy; Dist fc; // getDist(&fc, dist, idx); fc.fC = dist[idx + DIST_SIZE*0]; fc.fE = dist[idx + DIST_SIZE*1]; fc.fN = dist[idx + DIST_SIZE*2]; fc.fW = dist[idx + DIST_SIZE*3]; fc.fS = dist[idx + DIST_SIZE*4]; fc.fNE = dist[idx + DIST_SIZE*5]; fc.fNW = dist[idx + DIST_SIZE*6]; fc.fSW = dist[idx + DIST_SIZE*7]; fc.fSE = dist[idx + DIST_SIZE*8]; int type, orientation; decodeNodeType(map[idx], &orientation, &type); getMacro(&fc, type, orientation, &rho, v); cx = cx + v[0] * DT; cy = cy + v[1] * DT; // Periodic boundary conditions. if (cx > 192) cx = 0.0f; if (cy > 48) cy = 0.0f; if (cx < 0.0f) cx = (float)192; if (cy < 0.0f) cy = (float)48; x[gi] = cx; y[gi] = cy; } // // Performs the relaxation step in the BGK model given the density rho, // the velocity v and the distribution fi. // void BGK_relaxate(float rho, float *v, Dist *fi, int node_type) { Dist feq; #define vx v[0] #define vy v[1] #define vz v[2] feq.fC = 4*rho/9 + 4*rho*(-3*vx*vx/2 - 3*vy*vy/2)/9 ; feq.fE = rho/9 + rho*(3*vx*(1 + vx) - 3*vy*vy/2)/9 ; feq.fN = rho/9 + rho*(3*vy*(1 + vy) - 3*vx*vx/2)/9 ; feq.fW = rho/9 + rho*(-3*vx*(1 - vx) - 3*vy*vy/2)/9 ; feq.fS = rho/9 + rho*(-3*vy*(1 - vy) - 3*vx*vx/2)/9 ; feq.fNE = rho/36 + rho*(3*vx*(1 + vx) + 3*vy*(1 + vy + 3*vx))/36 ; feq.fNW = rho/36 + rho*(-3*vx*(1 - vx) + 3*vy*(1 + vy - 3*vx))/36 ; feq.fSW = rho/36 + rho*(-3*vx*(1 - vx) - 3*vy*(1 - vy - 3*vx))/36 ; feq.fSE = rho/36 + rho*(-3*vy*(1 - vy + 3*vx) + 3*vx*(1 + vx))/36 ; fi->fC += (feq.fC - fi->fC) / tau; fi->fE += (feq.fE - fi->fE) / tau; fi->fN += (feq.fN - fi->fN) / tau; fi->fW += (feq.fW - fi->fW) / tau; fi->fS += (feq.fS - fi->fS) / tau; fi->fNE += (feq.fNE - fi->fNE) / tau; fi->fNW += (feq.fNW - fi->fNW) / tau; fi->fSW += (feq.fSW - fi->fSW) / tau; fi->fSE += (feq.fSE - fi->fSE) / tau; if (!isWallNode(node_type)) { // External acceleration. #define eax 0.00000037807183364839f #define eay 0.00000000000000000000f #define eaz 0.00000000000000000000f float pref = rho * (3.0f - 3.0f/(2.0f * tau)); fi->fC += 4*pref*(-eax*vx - eay*vy)/9 ; fi->fE += pref*(eax*(1 + 2*vx) - eay*vy)/9 ; fi->fN += pref*(eay - eax*vx + 2*eay*vy)/9 ; fi->fW += pref*(-eax - eay*vy + 2*eax*vx)/9 ; fi->fS += pref*(-eay - eax*vx + 2*eay*vy)/9 ; fi->fNE += pref*(eax*(1 + 2*vx + 3*vy) + eay*(1 + 2*vy + 3*vx))/36 ; fi->fNW += pref*(eay*(1 - 3*vx + 2*vy) - eax*(1 - 2*vx + 3*vy))/36 ; fi->fSW += pref*(-eax - eay + vx*(2*eax + 3*eay) + vy*(2*eay + 3*eax))/36 ; fi->fSE += pref*(eax*(1 - 3*vy + 2*vx) - eay*(1 - 2*vy + 3*vx))/36 ; } #undef vx #undef vy #undef vz } __kernel void LBMCollideAndPropagate(__global int *map, __global float *dist_in, __global float *dist_out, __global float *orho, __global float *ovx, __global float *ovy, int save_macro) { int lx = get_local_id(0); // ID inside the current block int gx = get_global_id(0); int gy = get_group_id(1); int gi = gx + 192*gy; // shared variables for in-block propagation __local float prop_fE[BLOCK_SIZE]; __local float prop_fNE[BLOCK_SIZE]; __local float prop_fSE[BLOCK_SIZE]; __local float prop_fW[BLOCK_SIZE]; __local float prop_fNW[BLOCK_SIZE]; __local float prop_fSW[BLOCK_SIZE]; // cache the distributions in local variables Dist fi; getDist(&fi, dist_in, gi); int type, orientation; decodeNodeType(map[gi], &orientation, &type); // macroscopic quantities for the current cell float rho, v[2]; getMacro(&fi, type, orientation, &rho, v); boundaryConditions(&fi, type, orientation, &rho, v); barrier(CLK_LOCAL_MEM_FENCE); // only save the macroscopic quantities if requested to do so if (save_macro == 1) { orho[gi] = rho; ovx[gi] = v[0]; ovy[gi] = v[1]; } if (isFluidNode(type)) { BGK_relaxate(rho, v, &fi, type); } // update the 0-th direction distribution dist_out[gi] = fi.fC; // E propagation in shared memory if (lx < 63) { prop_fE[lx+1] = fi.fE; prop_fNE[lx+1] = fi.fNE; prop_fSE[lx+1] = fi.fSE; // E propagation in global memory (at right block boundary) } else if (gx < 191) { dist_out[gi + 9217] = fi.fE; if (gy < 47) { dist_out[gi + 46273] = fi.fNE; } if (gy > 0) { dist_out[gi + 73537] = fi.fSE; } } // periodic boundary conditions in the X direction else { dist_out[gi + 9025] = fi.fE; if (gy < 47) { dist_out[gi + 46081] = fi.fNE; } if (gy > 0) { dist_out[gi + 73345] = fi.fSE; } } // W propagation in shared memory if (lx > 0) { prop_fW[lx-1] = fi.fW; prop_fNW[lx-1] = fi.fNW; prop_fSW[lx-1] = fi.fSW; // W propagation in global memory (at left block boundary) } else if (gx > 0) { dist_out[gi + 27647] = fi.fW; if (gy < 47) { dist_out[gi + 55487] = fi.fNW; } if (gy > 0) { dist_out[gi + 64319] = fi.fSW; } } // periodic boundary conditions in the X direction else { dist_out[gi + 27839] = fi.fW; if (gy < 47) { dist_out[gi + 55679] = fi.fNW; } if (gy > 0) { dist_out[gi + 64511] = fi.fSW; } } barrier(CLK_LOCAL_MEM_FENCE); // Save locally propagated distributions into global memory. // The leftmost thread is not updated in this block. if (lx > 0) { dist_out[gi + 9216] = prop_fE[lx]; if (gy < 47) { dist_out[gi + 46272] = prop_fNE[lx]; } if (gy > 0) { dist_out[gi + 73536] = prop_fSE[lx]; } } // Propagation in directions orthogonal to the X axis (global memory) if (gy < 47) { dist_out[gi + 18624] = fi.fN; } if (gy > 0) { dist_out[gi + 36672] = fi.fS; } // The rightmost thread is not updated in this block. if (lx < 63) { dist_out[gi + 27648] = prop_fW[lx]; if (gy < 47) { dist_out[gi + 55488] = prop_fNW[lx]; } if (gy > 0) { dist_out[gi + 64320] = prop_fSW[lx]; } } }

                            • Kernel does not run when outside profiler
                              genaganna

                               

                              Originally posted by: ljbade OK here is the autogenerated opencl code.

                               

                              This project appears to use a template generating system to create custom kernels on the fly depending on the dimensions etc. that can target both cuda and opencl with the same code.

                               

                               



                              Just giving kernel code would be help me.  I installed python on mysystem and executed lbm_ldc.py. 

                              I am getting following

                               

                              C:\Documents and Settings\Administrator\Desktop\sailfish-ljbades-sailfish\examples>python.exe lbm_ldc.py
                              Traceback (most recent call last):
                                File "lbm_ldc.py", line 3, in <module>
                                  import numpy
                              ImportError: No module named numpy

                               

                              Could you please give me steps required to run your lbm_ldc.py samples?