Archives Discussions

abhisrn · ‎10-15-2012

I am actually looping an openCL call to kernel several times.

In my openCL kernel the current value at a particular location in a given workgroup is updated according to the neighboring values from the previous iteration in the loop, but when the neighbor is from a previous workgroup then that value is not considered at all while updating the current location value in the current workgroup.

However the values from previous iteration from the neighbours which are in the current workgroup or in the next workgroup are considered.

As far as I understand, the values from previous workgroup have been updated before the current workgroup and hence these values cannot be considered as they are not the previous iteration values but current iteration values(as I use memory barriers which is necessary for synchronization).

I tried to use another buffer for storing the previous iteration values but the same issue is still present.

Is there anyway to solve this issue in openCL?

I have also attached the openCL kernel for reference.

LeeHowes · ‎10-15-2012

There are no guarantees on launch order of workgroups. Even though the AMD runtime does launch in the obvious order, there is no guarantee of completion order given the affects of memory operations and the like. Fences only ensure an ordering within the work item (in effect within the wavefront) not between workgroups. Barriers only synchronize within a given group, not between groups.

Officially there is no way to communicate between workgroups short of kernel completion (ie a global barrier). You can make it work with care, but without real portability.

abhisrn · ‎10-15-2012

Thanks a lot Lee Howes, that was very helpful but still I have some confusion.

As far as I know we could control the order of the commands executed, here in our case the kernel command is called within a loop in the host program and in each iteration we wait for the commands to complete(using clFinish), assign the buffer values to an another buffer(consider it as previous values buffer) and then continue to the next iteration, where the kernel is again called as shown below.

loop n times

{

1.call kernel(uses the previous buffer values for updating the current buffer values, which includes the 4 neighbours if we consider a 2d grid)

2.wait for the kernel command to finish.

3.copy the current buffer values to the previous buffer.

4.wait for all commands to finish.

}

The values from the previous values buffer could be used to update the current buffer values , however the current buffer values are updated based on the neighboring values(in the corresponding previous values buffer from previous iteration) within the same work-group(wave front) or the next work-groups but the previous work-group values in the previous values buffer are completely neglected by OpenCL. Theoretically all the neighbors including corresponding values in the previous work-groups if present should also be considered. After each clFinish all the values in current buffer are updated then only we copy these values to the previous values buffer, so theses values are available in the next kernel call. My point is why its not working as expected even tough, previous values buffer is declared global and also a read only buffer so we cannot assign values to previous values buffer within the kernel?

Sorry for such a long explanation , I wanted to make my problem clear. The problem becomes clearer when attached kernel code is seen.

LeeHowes · ‎10-15-2012

Ah. Using clFinish in that way won't be very efficient, the queue should be in-order anyway so you should just be able to loop.

However, that doesn't answer your question. The previous values buffer should have all the values whether you use clFinish or you just put the kernel enqueue after the previous copy operation in the queue.

Presumably Set_Bnd_Projection doesn't update p at all? You do need the second buffer because if you try to update the same buffer you *may* write the values out early, it's just you can't guarantee whether they will or will not be written out earlier and hence read by neighbouring groups. As you use separate buffers in this case it really should work.

Can you try reading out the buffers at each stage to check that it is getting updated as you expect, and then you can narrow the problem down a little?

abhisrn · ‎10-19-2012

Hello Lee Howes,

Thanks for replying,I have removed the clFinish in order to make the code more effecient.

I have also removed the Set_Bnd_Projection and created a new kernel which is called from inside the loop in the host program instead of calling it in the Projection_Solver kernel, in order to remove side effects which might be caused by the Set_Bnd_Projection. I also read out the buffers at each stage(using AMD gDEBugger and by reading out in the host program), the same issue as before still exists eventough I am using two different buffers exactly like before. Also I have attached the Set_Bnd_Projection kernel in order to see if this might be causing the problem.

notzed · ‎10-15-2012

Are you missing any code in that example - why do you have local barriers if you don't use any local memory? Is set_bnd_projection() updating or accessing *any* value outside of p[index] or obstacles[index]? From that code this is your likely problem. If it has to access data outside that generated in the workgroup, you need to run it as a separate pass.

The general model you're explaining works, although if the buffers are the same size just swap the buffers and don't bother copying.

abhisrn · ‎10-19-2012

Hello notzed,

Thanks for the suggestion. I did run the set_bnd_projection() in a seperate pass but the same problem still exists and also the method only needs the global data from p and obstacles buffer.

notzed · ‎10-19-2012

As I suspected - you're accessing items outside of get_global_id() and then writing to get_global_id().

This will not work because you don't know when the kernel will run, and they can't all run synchronously and in-step. i.e. a later kernel may see a different number in x[left/right/etc] than a previous one did.

Assuming the algorithm will work this way, you need to add another buffer, say 'xnext', and write the new value from set_bnd_projection to 'xnext[index]'. Then feed that in to the next loop as 'x'.

i.e. something like

1. solver(x, obstacles, ...)

2. set_bnd_projection(x, obstacles, xnext)

(x is only ever read, obstacles is only read, xnext is only ever written to)

3. swap x and xnext

(can just be done on the host by swapping the memory objects)

abhisrn · ‎10-19-2012

Thanks for the reply notzed.

Pardon, I didnt get the sentence "accessing items outside of get_global_id() and then writing to get_global_id()"

I might be worng, as far as I know the commands are executed in order in a queue(so each kernel is executed in order, when the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not set, so by default the commands execute in order).

I did use one more buffer as you suggested but it turns out the same issue is still present.

notzed · ‎10-19-2012

"accssing items outside get_global_id()"

I mean accessing array elements other than "i", where "i=get_global_id(0)"

Since you write to array in the kernel, you can't read from anything other than array in the same kernel - otherwise other work-items might see a different value.

Hence 'accessing items out side of'.

Each kernel itself runs on many cpu's concurrently - on each of those you do not have control of the order. I should have said work-groups or work-items,

abhisrn · ‎10-22-2012

Ah, now I get it. Anyway, I did follow the steps provided by you. I have attached the new Set_Bnd_Projection() kernel, where I call it in a seperate pass rather than within the kernel Projection_Solver() and also I did use a new buffer for only reading the values. I have attached new Set_Bnd_Projection kernel , where you can see I did use a buffer for only reading the values and another buffer for writng values. However, the same issue still exists , i.e the values from work groups which are previous to the current work group in the new buffer (used only for reading in the kernel) are not visible to the work items of the current workgroup.

notzed · ‎10-22-2012

Well, unless the code relies on some 'cascade' effect of changing inputs depending on previous work-items within the same 'outer loop', those changes should work.

Probably best to include the host code too, incase you missed something there.

abhisrn · ‎10-22-2012

Thanks for replying notzed.

For the host code I am using Cloo which is C# wrapper.

The host code is given below:

public void Line_Solve_Projection (Cloo.ComputeBuffer<int> xGridSize, Cloo.ComputeBuffer<int> yGridSize,

Cloo.ComputeBuffer<int> zGridSize, Cloo.ComputeBuffer<float> p, Cloo.ComputeBuffer<float> p0,

Cloo.ComputeBuffer<float> div,Cloo.ComputeBuffer<int> obstacles, float a, float c)

{

int l;

for (l=0; l<20; l++)

{

//Projection_Solver kernel call

projectionSolverKernel.SetMemoryArgument(0, xGridSize);

projectionSolverKernel.SetMemoryArgument(1, yGridSize);

projectionSolverKernel.SetMemoryArgument(2, zGridSize);

projectionSolverKernel.SetMemoryArgument(3, p);

projectionSolverKernel.SetMemoryArgument(4, p0);

projectionSolverKernel.SetMemoryArgument(5,div);

projectionSolverKernel.SetMemoryArgument(6, obstacles);

projectionSolverKernel.SetValueArgument(7,a);

projectionSolverKernel.SetValueArgument(8,c);

commands.Execute(projectionSolverKernel, null,mSize,mWorkGroup3DSize,null);

//copy the the values from p(which used only for writing in the kernels) to p0(which is used only for reading)

commands.CopyBuffer(p,p0,null);

//Set_Bnd_Projection kernel call

setBoundaryPressuresKernel.SetMemoryArgument(0,p);

setBoundaryPressuresKernel.SetMemoryArgument(1,p0);

setBoundaryPressuresKernel.SetMemoryArgument(2,obstacles);

commands.Execute(setBoundaryPressuresKernel,null,mSize,mWorkGroup3DSize,null);

//copy the the values from p(which used only for writing in the kernels) to p0(which is used only for reading)

commands.CopyBuffer(p,p0,null);

;

}

abhisrn · ‎10-22-2012

Finally , found that the problem was that I was using the get_local_id() in the if conditions(for the boundaries of the grid in Set_bnd_Projection and Projection_Solver Kernels) instead I should have used the get_global_id().

Thanks a lot everyone for your replies.

Archives Discussions

OpenCL Synchronization between workgroups.