Hi,
I'm currently working on a __kernel which executes within a loop, say 50000 times. Kernel code looks like below:
__kernel void forward_euler3(
__global double2 *p0,
__global double2 *p1,
__global double2 *p2,
__global const double2 *v,
__constant double *dx,
__constant double2 *tw,
__constant double *sm1,
__constant double *sm2) {
uint x = get_global_id(0);
uint y = get_global_id(1);
uint width = get_global_size(0);
uint height = get_global_size(1);
uint id = x + y * width;
double2 thp;
double2 p10;
double2 p11;
double2 p12;
double2 p13;
double2 p14;
if( x >= 1 && x < (width-1) && y >= 1 && y < height - 1) {
p10 = p1[id];
p11 = p1[id+width];
p12 = p1[id-width];
p13 = p1[id+1];
p14 = p1[id-1];
thp.x = v[id].x*p10.x - v[id].y*p10.y - (-2.0*p10.x + p11.x + p12.x)/ *dx/ *dx/2.0/ *sm1 - (-2.0*p10.x + p14.x + p13.x)/ *dx/ *dx/2.0/ *sm2;
thp.y = v[id].x*p10.y + v[id].y*p10.x - (-2.0*p10.y + p11.y + p12.y)/ *dx/ *dx/2.0/ *sm1 - (-2.0*p10.y + p14.y + p13.y)/ *dx/ *dx/2.0/ *sm2;
p2[id].x = p0[id].x - 2.0* (*tw).y*thp.y;
p2[id].y = p0[id].y + 2.0* (*tw).y*thp.x;
} else {
p2[id].x = 0;
p2[id].y = 0;
p0[id].x=p1[id].x;
p0[id].y=p1[id].y;
}
}
Kernel fires with double2 input, which represents a complex number in this case. Of course
extenstion is set.
cl_khr_fp64
That's an iterative function related to quantum mechanics, but actually it doesnt matter in this case. As written above, such __kernel runs in a loop, but every iteration has different input arguments order (actually just the p* ones). I accomplish it by setting different buffer for given iteration, which runs like below:
if(i%3==1) {
_kernel.setArg(0,b3);
_kernel.setArg(1,b1);
_kernel.setArg(2,b2);
} else if(i%3==0) {
_kernel.setArg(0,b2);
_kernel.setArg(1,b3);
_kernel.setArg(2,b1);
} else if(i%3==2) {
_kernel.setArg(0,b1);
_kernel.setArg(1,b2);
_kernel.setArg(2,b3);
}
Such method with switching input gives me a kernel behaviour like a normal iterative method fired on CPU [within a loop, when a t while is known, after computations inside loop we know while t+2dt].
Ok, that's a short draft what I'm doing. Now, let me ask about a few things.
Let me start from host one. I have a dual core CPU, which gives 2 work-items. Running above code on CPU gives always correct results. Even without such else condition, which is just for GPU. Actually, I have no idea what is this for while running kernel on GPU, but without it kernel usually simply crash.
"Nulling" p2 is done before kernel even starts. The next two instructions represents copying data from p1 to p0, which might be translated as "give me t+dt while as current one - that's done for next iteration". Yes, but these two instructions are done, when I'm switching the order of input arguments [see second quotation]. I manage buffers such that they are in the same order every third iteration, which should be enough in this case and instructions
shouldn't even exist. But as I said, without it kernel simply crash - does anyone have an explanation for that ?
p0[id].*=p1[id].*;
I would appreciate any feedback. I know that things describes above may be quite difficult, but such problems get me a headache every day.
I have a dual core CPU, which gives 2 work-items. Running above code on CPU gives always correct results. Even without such else condition, which is just for GPU. Actually, I have no idea what is this for while running kernel on GPU, but without it kernel usually simply crash.
The difference in results between GPU and CPU is expected as CPU got better precision, but you should not encounter a crash. Could you post the source code? or a testcase which produces the crash?
http://www.khronos.org/message_boards/viewtopic.php?f=28&t=2440 [which is really helpful cause fully describes memory flags for buffers - and OpenCL doc doesn't, anyway for me]. How ATI cards deal with it? I mean for instance CL_MEM_USE_HOST_PTR flag. Does the ATI OpenCL implementation ensures that kernel issues on the GPU with allocated GPU memory or simply uses PCI Express bus? I assume that GPU memory accessibility is much faster than PciExpress bus+host memory. And if implementation uses GPU memory, does it copy the data or it's user duty [regarding to CL_MEM_USE_HOST_PTR flag]?
Currently the flags does not make any difference.
Hi!
I have a dual core CPU, which gives 2 work-items. Running above code on CPU gives always correct results. Even without such else condition, which is just for GPU. Actually, I have no idea what is this for while running kernel on GPU, but without it kernel usually simply crash.
The difference in results between GPU and CPU is expected as CPU got better precision, but you should not encounter a crash. Could you post the source code? or a testcase which produces the crash?
Could you explain why GPU precision is not as good as CPU one? I suppose it might be due to the architecture [which probably hasn't built for double precision computations]. And would be best if you could point some existing docs or something. I could cite it in my doc.
Well, develop a simple testcase is quite difficult due to my program complexity. The base code is written in Fortran and just a chunk bases on C++ [of course it is the most 'time-consuming' part, which GPU issues much faster].
Anyway, I tried to write a simple testcase to reproduce the problem and it was almost done. However I found a very nasty bug, which ruined the whole computations so far. It seems that problem is solved. Sorry for spamming such big and pointless thread.
http://www.khronos.org/message_boards/viewtopic.php?f=28&t=2440 [which is really helpful cause fully describes memory flags for buffers - and OpenCL doc doesn't, anyway for me]. How ATI cards deal with it? I mean for instance CL_MEM_USE_HOST_PTR flag. Does the ATI OpenCL implementation ensures that kernel issues on the GPU with allocated GPU memory or simply uses PCI Express bus? I assume that GPU memory accessibility is much faster than PciExpress bus+host memory. And if implementation uses GPU memory, does it copy the data or it's user duty [regarding to CL_MEM_USE_HOST_PTR flag]?
Currently the flags does not make any difference.
That's quite unexpected answer 🙂 Could you explain from your point of view how ATI implements it? Best would be to point where ATI implementation varies OpenCL specification.
thanks for your help!
bump?