Hi,
I have some doubts about the use of 'write_mem_fence' and 'barrier'.
Here is my code :
__local WorkItemData workItemData;
const size_t lid = get_local_id(0);
if (lid < 1)
{
workItemData.sky = sky;
workItemData.lightsCount = lightsCount;
workItemData.emitters = emitters;
}
//---- Sync : wait that 'workItemData' is fully initialized
barrier(CLK_LOCAL_MEM_FENCE);
//write_mem_fence(CLK_LOCAL_MEM_FENCE);
The goal is to insure that the variable 'workItemData is well initialized'. Can I use a write_mem_fence ? Will it be faster and more efficient ?
Is there some other way to improve performance in this case, by example if I put more operation between the initialization and use, sometimes I don't need to synchronize (But it is dangerous) ! Maybe there are some other ways.
Thanks
Solved! Go to Solution.
As you point out, it's a synchronization issue, therefore it needs a synchronization operation. In OpenCL the only synchronization operations are barriers. The fence would only guarantee that the compiler and hardware won't reorder writes across the fence, it won't ensure that work items wait for each other.
Given that the hardware runs 64 work items simultaneously, if you only have 64 work items in your group you could in theory drop the barrier. However, if you only have 64 work items in your group the compiler will do that for you anyway so you might as well put it in in practice. If you have more than 64 work items then you need the barrier to ensure that the other wavefront waits for the one that did the writes.
What kind of data are they, sky, emitters and lightsCount?This post would be helpful, though.
Will it be faster and more efficient ?--I would compare their assemblers through Kernel Analyzer.
Theses are pointers to buffers. So my goal is to insure they are available in all the work-items.
Not sure that analyzing the binaries will help ! It is a question about synchronization 😛 not instructions.
As you point out, it's a synchronization issue, therefore it needs a synchronization operation. In OpenCL the only synchronization operations are barriers. The fence would only guarantee that the compiler and hardware won't reorder writes across the fence, it won't ensure that work items wait for each other.
Given that the hardware runs 64 work items simultaneously, if you only have 64 work items in your group you could in theory drop the barrier. However, if you only have 64 work items in your group the compiler will do that for you anyway so you might as well put it in in practice. If you have more than 64 work items then you need the barrier to ensure that the other wavefront waits for the one that did the writes.
Thanks a lot Lee, for your complete explanation.