philips

unexpected barrier / mem_fence behavior

Discussion created by philips on Jun 17, 2010
Latest reply on Aug 14, 2010 by philips
barrier necessary even though mem_fence should be enough

 

The following is a piece of code that I use in order bo batch a group of operations. Since I am porting from Cuda, I have kept the warp structure. The Cuda code, however, did not use any synchronization barriers. 

 

At the start of the kernel I use fetchWorkFirst. Here it doesn't seem to matter if I use barrier or mem_fence.

Then there is a loop at the end of which I call fetchWorkNext.

 

 

The workgroup size is 32 x 2. The work items with get_local_id(0) == 0  read from a global variable to find out which data the 32 work items should work on and subsequently writes that value to a local variable.  

 

Now I would think that a mem_fence(CLK_LOCAL_MEM_FENCE) should be sufficient. So the following 31 work items woul have to wait to read the value in the local variable that was just written to in workitem 0.

However, using mem_fence the result of the kernel is wrong. It works with a barrier, however.

Why does mem_fence not work here?


void fetchWorkFirst(int* warp, int* batchCounter, __global int* g_warpCounter, int batchSize, volatile __local S32* sharedTemp) { if (get_local_id(0) == 0) *sharedTemp = atom_add(g_warpCounter, batchSize); mem_fence(CLK_LOCAL_MEM_FENCE); *warp = *sharedTemp; *batchCounter = batchSize; } void fetchWorkNext(int* warp, int* batchCounter, __global int* g_warpCounter, int batchSize, volatile __local S32* sharedTemp) { (*batchCounter)--; if (*batchCounter > 0) (*warp)++; else { if (get_local_id(0) == 0) *sharedTemp = atom_add(g_warpCounter, batchSize); barrier(CLK_LOCAL_MEM_FENCE); *batchCounter = batchSize; *warp = *sharedTemp; } }

Outcomes