Archives Discussions

lupescu_grigore · ‎03-27-2011

Given the following kernel :

__kernel void testKernel(__global float* input,__global float* output,int nrElements)

{

uint x = get_global_id(0);

if(x<(nrElements-1))

output=input[x+1];

else

output=input[0];

mem_fence(CLK_LOCAL_MEM_FENCE);

output++;

}

And the input 0 1 0 1 0 1 0 1 0 1

One would expect 2 1 2 1 2 1 2 1 as output ( shift + add 1)

Results are :

No mem_fence

CPU Core i3 OK

GPU 5470 OK

With mem_fence

CPU Core i3 OK

GPU 5470 ERROR - values got { 1 2 3 4 5 } ... 2 1 2 1 2 ... 3 2 3 2 .. 5 4 5 4...

Inputs need not be very long. 100-1000 will do.

Am i missing something ?

Meteorhead · ‎03-27-2011

You are tyring to mem_fence a variable with wrong address space qualifiers. output is a __global vairable and you are using CLK_LOCAL_MEM_FENCE.

You should be using CLK_GLOBAL_MEM_FENCE.

Please read the related section of the OpenCL specification, it states clearly what the purpose of this function is. (All memory operations before and after the fence do not mix in the specified address space) Read also for restrictions on using sync commands of this type.

nou · ‎03-27-2011

IMHO in this case it shoudl translate into NOP operation or don't have any effect. or i don't see a reason why it shoudl give a wrong result.

lupescu_grigore · ‎03-28-2011

Originally posted by: Meteorhead You are tyring to mem_fence a variable with wrong address space qualifiers. output is a __global vairable and you are using CLK_LOCAL_MEM_FENCE.

You should be using CLK_GLOBAL_MEM_FENCE.

Please read the related section of the OpenCL specification, it states clearly what the purpose of this function is. (All memory operations before and after the fence do not mix in the specified address space) Read also for restrictions on using sync commands of this type.

Neither fence nor barrier work in that kernel. Tryed also with GLOBAL type. In both cases problem seem to arise from their use - without adding fence/barrier kernel works.

himanshu_gautam · ‎03-28-2011

I also agree with nou that these fences should be converted to NOPs.

Anyhow I think your program doesn't need any fences altogether

But can you please send a testcase along with the infromation about SDK,DRIVER and OS

lupescu_grigore · ‎03-28-2011

Ubuntu 10.10 maverick, 64 bit

Ati Catalyst 11.2

Ati Stream SDK 2.3

himanshu_gautam · ‎03-28-2011

lupescu_grigore,

Ubuntu 10.10 is not a supported Operating system. See http://developer.amd.com/gpu/AMDAPPSDK/pages/DriverCompatibility.aspx

So it is not officially guaranteed to work for you. But I hope the SDK samples are working for you.Are they?

Before concluding it to be bug I suggest you to try on some other supported system and share your results. You can also post the host code so some other developers may try it on their system at once if they wish to.

lupescu_grigore · ‎03-28-2011

SDK Samples and other kernels i wrote are working fine.

nou · ‎03-28-2011

BTW i put that kernel into SKA and when i commented out that fence it has effect.

and i don't see a reason why ther should be a fence.

lupescu_grigore · ‎03-28-2011

The kernel is just an example. I wanted to see no problems occur, before moving on to a more complex problem. I want to use fence/barrier in FFT 1D between iterations, so no problems occur because of multiple thread access.

Coming back to the posted kernel, why shouldn't i use a fence/barrier in this case ? Isn't there a chance a thread would get to add 1 to memory before the whole shift takes place ? ( the whole point of the kernel was to test multiple memory access with/without fences/barriers)

himanshu_gautam · ‎03-28-2011

There is no need of synchronization between events of same thread. As no thread modifies the input array no sync is needed.

Any how one is not allowed to have sync at global level, it is only availlable at workgroup level.

EDIT: I think most of the samples would be using barriers/mem_fence. So it might be something else.

lupescu_grigore · ‎03-28-2011

The output is the one in question, to be more precise between

output=input[x+1]; and output++; one other thread X may step in

and increase before thread Y had a chance to shift.

I will double check more examples (only tryed a couple - nbody, black scholes, bitonic sort). I am trying to see the effect on windows but having trouble after upgrading to SDK 2.3 (from 2.2) - CAL version mismatch

himanshu_gautam · ‎03-28-2011

The first instruction sets output to input[x+1] and the second increments the output by 1. I don't see any other thread which would access output so there can't be any step in of thread.

I also suggest you try with a supported configuration as Ubuntu 10.10 is not supported.

Also refer to http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=145590&highlight_key=y for CAL version mismatch problem

Thanks

lupescu_grigore · ‎03-28-2011

You're right, my bad

I installed the latest catalyst version and it worked, thanks.

I will test on windows and linux again both this version + another regarding fences. Windows = Server 2008 R2 ( i am aware it's not supported )

Archives Discussions

OpenCL mem_fence