You are tyring to mem_fence a variable with wrong address space qualifiers. output is a __global vairable and you are using CLK_LOCAL_MEM_FENCE.
You should be using CLK_GLOBAL_MEM_FENCE.
Please read the related section of the OpenCL specification, it states clearly what the purpose of this function is. (All memory operations before and after the fence do not mix in the specified address space) Read also for restrictions on using sync commands of this type.
IMHO in this case it shoudl translate into NOP operation or don't have any effect. or i don't see a reason why it shoudl give a wrong result.
Originally posted by: Meteorhead You are tyring to mem_fence a variable with wrong address space qualifiers. output is a __global vairable and you are using CLK_LOCAL_MEM_FENCE.
You should be using CLK_GLOBAL_MEM_FENCE.
Please read the related section of the OpenCL specification, it states clearly what the purpose of this function is. (All memory operations before and after the fence do not mix in the specified address space) Read also for restrictions on using sync commands of this type.
Neither fence nor barrier work in that kernel. Tryed also with GLOBAL type. In both cases problem seem to arise from their use - without adding fence/barrier kernel works.
I also agree with nou that these fences should be converted to NOPs.
Anyhow I think your program doesn't need any fences altogether
But can you please send a testcase along with the infromation about SDK,DRIVER and OS
Ubuntu 10.10 maverick, 64 bit
Ati Catalyst 11.2
Ati Stream SDK 2.3
lupescu_grigore,
Ubuntu 10.10 is not a supported Operating system. See http://developer.amd.com/gpu/AMDAPPSDK/pages/DriverCompatibility.aspx
So it is not officially guaranteed to work for you. But I hope the SDK samples are working for you.Are they?
Before concluding it to be bug I suggest you to try on some other supported system and share your results. You can also post the host code so some other developers may try it on their system at once if they wish to.
SDK Samples and other kernels i wrote are working fine.
BTW i put that kernel into SKA and when i commented out that fence it has effect.
and i don't see a reason why ther should be a fence.
The kernel is just an example. I wanted to see no problems occur, before moving on to a more complex problem. I want to use fence/barrier in FFT 1D between iterations, so no problems occur because of multiple thread access.
Coming back to the posted kernel, why shouldn't i use a fence/barrier in this case ? Isn't there a chance a thread would get to add 1 to memory before the whole shift takes place ? ( the whole point of the kernel was to test multiple memory access with/without fences/barriers)
There is no need of synchronization between events of same thread. As no thread modifies the input array no sync is needed.
Any how one is not allowed to have sync at global level, it is only availlable at workgroup level.
EDIT: I think most of the samples would be using barriers/mem_fence. So it might be something else.
The output is the one in question, to be more precise between
output
and increase before thread Y had a chance to shift.
I will double check more examples (only tryed a couple - nbody, black scholes, bitonic sort). I am trying to see the effect on windows but having trouble after upgrading to SDK 2.3 (from 2.2) - CAL version mismatch
The first instruction sets output
I also suggest you try with a supported configuration as Ubuntu 10.10 is not supported.
Also refer to http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=145590&highlight_key=y for CAL version mismatch problem
Thanks
You're right, my bad
I installed the latest catalyst version and it worked, thanks.
I will test on windows and linux again both this version + another regarding fences. Windows = Server 2008 R2 ( i am aware it's not supported )