blelump

Running kernel in a loop; how does CL_MEM_USE_HOST_PTR flag work

Discussion created by blelump on Apr 8, 2010
Latest reply on Apr 16, 2010 by blelump

Hi,

I'm currently working on a __kernel which executes within a loop, say 50000 times. Kernel code looks like below:

__kernel void forward_euler3(
      __global double2 *p0,
      __global double2 *p1,
      __global double2 *p2,
      __global const double2 *v,
      __constant double *dx,
      __constant double2 *tw,
      __constant double *sm1,
      __constant double *sm2) {

  uint x = get_global_id(0);
  uint y = get_global_id(1);

  uint width = get_global_size(0);
  uint height = get_global_size(1);

  uint id = x + y * width;
 
  double2 thp;
 
  double2 p10;
  double2 p11;
  double2 p12;
  double2 p13;
  double2 p14;

  if( x >= 1 && x < (width-1) && y >= 1 && y < height - 1) {
   
    p10 = p1[id];
    p11 = p1[id+width];
    p12 = p1[id-width];
    p13 = p1[id+1];
    p14 = p1[id-1];
 
    thp.x = v[id].x*p10.x - v[id].y*p10.y - (-2.0*p10.x + p11.x + p12.x)/ *dx/ *dx/2.0/ *sm1 - (-2.0*p10.x + p14.x + p13.x)/ *dx/ *dx/2.0/ *sm2;
    thp.y = v[id].x*p10.y + v[id].y*p10.x - (-2.0*p10.y + p11.y + p12.y)/ *dx/ *dx/2.0/ *sm1 - (-2.0*p10.y + p14.y + p13.y)/ *dx/ *dx/2.0/ *sm2;

    p2[id].x = p0[id].x - 2.0* (*tw).y*thp.y;
    p2[id].y = p0[id].y + 2.0* (*tw).y*thp.x;

  } else {
    p2[id].x = 0;
    p2[id].y = 0;
    p0[id].x=p1[id].x;
    p0[id].y=p1[id].y;
  }
}

Kernel fires with double2 input, which represents a complex number in this case. Of course

cl_khr_fp64
extenstion is set.

That's an iterative function related to quantum mechanics, but actually it doesnt matter in this case. As written above, such __kernel runs in a loop, but every iteration has different input arguments order (actually just the p* ones). I accomplish it by setting different buffer for given iteration, which runs like below:

if(i%3==1) {
     _kernel.setArg(0,b3);
     _kernel.setArg(1,b1);
     _kernel.setArg(2,b2);
      } else if(i%3==0) {
    _kernel.setArg(0,b2);
    _kernel.setArg(1,b3);
    _kernel.setArg(2,b1);
      } else if(i%3==2) {
    _kernel.setArg(0,b1);
    _kernel.setArg(1,b2);
    _kernel.setArg(2,b3);
      }

Such method with switching input gives me a kernel behaviour like a normal iterative method fired on CPU [within a loop, when a t while is known, after computations inside loop we know while t+2dt].

 

Ok, that's a short draft what I'm doing. Now, let me ask about a few things.

Let me start from host one. I have a dual core CPU, which gives 2 work-items. Running above code on CPU gives always correct results. Even without such else condition, which is just for GPU. Actually, I have no idea what is this for while running kernel on GPU, but without it kernel usually simply crash.

  • That's my first question, why else condition is needed at all? It looks like without such instruction kernel doesn't know what to do when if returns false - [an unexpected behaviour?]. If you ever developed a VHDL, you will get my point quickly. And for those who don't: when programming a device, every condition instruction must have  declared all ways, how to deal with it. So, for instance 'if' condition cannot exist without 'else' due to flip-flops/latches problems - I don't remember exactly. Yes, in this case it's strictly hardware 'issue'.

"Nulling" p2 is done before kernel even starts. The next two instructions represents copying data from p1 to p0, which might be translated as "give me t+dt while as current one - that's done for next iteration". Yes, but these two instructions are done, when I'm switching the order of input arguments [see second quotation]. I manage buffers such that they are in the same order every third iteration, which should be enough in this case and instructions

p0[id].*=p1[id].*;
shouldn't even exist. But as I said, without it kernel simply crash - does anyone have an explanation for that ?

 

  • And the next one, regarding to this topic http://www.khronos.org/message_boards/viewtopic.php?f=28&t=2440 [which is really helpful cause fully describes memory flags for buffers - and OpenCL doc doesn't, anyway for me]. How ATI cards deal with it? I mean for instance CL_MEM_USE_HOST_PTR flag. Does the ATI OpenCL implementation ensures that kernel issues on the GPU with allocated GPU memory or simply uses PCI Express bus? I assume that GPU memory accessibility is much faster than PciExpress bus+host memory. And if implementation uses GPU memory, does it copy the data or it's user duty [regarding to CL_MEM_USE_HOST_PTR flag]?

 

  • Last one, may double precision computations cause loss precision at all? So far that's my best explanation which fits and explains why, during the iterations result varies the correct one.

 

I would appreciate any feedback. I know that things describes above may be quite difficult, but such problems get me a headache every day.

Outcomes