Meteorhead

implicit(?) sync

Discussion created by Meteorhead on Nov 30, 2011
Latest reply on Dec 3, 2011 by Meteorhead
threads conflict in __local

Hi!

I would like to ask something concerning __local storage. Is it possible that a thread conflicts with itself when writing to __local? I have encountered a problem, where according to the algorithm, threads should not be able to conflict with reading and writing each other's memory (I placed sync commands at every corner), but simulation data became corrupt, which does not occur if I introduce atomic_xor() into __local updating (which without it would be just ^=).

I am quite certain, that the algorithm does not allow threads to conflict, but let me ask: how are __local accesses arranged inside a single thread, and outside a warp but inside the workgroup?

Code is attached. The reason behind the 3 updates is that on the simulated 2D lattice, one update consists of updating self, and two neighbours, but because of bit coding, and one integer holding 4by4 lattice sites, and sites are picked at random, there is no telling at compile time if self and neighbours will reside inside the same integer, or in different ones. There we calculate indexes and shifts with a function. Vector elements are spaced far enough from others, that they definately do not conflict with other vector elements. (So basically conflict either happens on lane x or y or z or w.

inline void updateVector(const uint4 coordsX, const uint4 coordsY, const uint4 flip, __local uint* cache) { uint4 index; uint4 shift; uint4 mask; uint4 notOnDeadBorder; notOnDeadBorder = (((coordsX + (uint4)(1u)) == (uint4)(LOCAL_WIDTH_IN_LATTICES)) || ((coordsY + 1u) == (uint4)(LOCAL_HEIGHT_IN_LATTICES))) ? (uint4)(0u,0u,0u,0u) : (uint4)(3u,3u,3u,3u) ; // Update self index = convert_uint4(floor(0.250001f * convert_float4(coordsY % LOCAL_HEIGHT_IN_LATTICES)) * LOCAL_WIDTH_IN_INTS) + convert_uint4(floor(0.250001f * convert_float4(coordsX % LOCAL_WIDTH_IN_LATTICES))); shift = 30 - ( ((coordsY % 4u) * 8u) + ((coordsX % 4u) * 2u) ); mask = ((flip && notOnDeadBorder) ? (uint4)(3u,3u,3u,3u) : (uint4)(0u,0u,0u,0u) ) << shift; atomic_xor( &cache[index.s0], mask.s0 ); atomic_xor( &cache[index.s1], mask.s1 ); atomic_xor( &cache[index.s2], mask.s2 ); atomic_xor( &cache[index.s3], mask.s3 ); // Update right neighbour index = convert_uint4(floor(0.250001f * convert_float4(coordsY % LOCAL_HEIGHT_IN_LATTICES)) * LOCAL_WIDTH_IN_INTS) + convert_uint4(floor(0.250001f * convert_float4((coordsX + 1u) % LOCAL_WIDTH_IN_LATTICES))); shift = 30 - ( ((coordsY % 4u) * 8u) + (((coordsX + 1u) % 4u) * 2u) ); mask = ((flip && notOnDeadBorder) ? (uint4)(1u,1u,1u,1u) : (uint4)(0u,0u,0u,0u)) << shift; atomic_xor( &cache[index.s0], mask.s0 ); atomic_xor( &cache[index.s1], mask.s1 ); atomic_xor( &cache[index.s2], mask.s2 ); atomic_xor( &cache[index.s3], mask.s3 ); // Update bottom neighbour index = convert_uint4(floor(0.250001f * convert_float4((coordsY + 1u) % LOCAL_HEIGHT_IN_LATTICES)) * LOCAL_WIDTH_IN_INTS) + convert_uint4(floor(0.250001f * convert_float4(coordsX % LOCAL_WIDTH_IN_LATTICES))); shift = 30 - ( (((coordsY + 1u) % 4u) * 8u) + ((coordsX % 4u) * 2u) ); mask = ((flip && notOnDeadBorder) ? (uint4)(2u,2u,2u,2u) : (uint4)(0u,0u,0u,0u)) << shift; atomic_xor( &cache[index.s0], mask.s0 ); atomic_xor( &cache[index.s1], mask.s1 ); atomic_xor( &cache[index.s2], mask.s2 ); atomic_xor( &cache[index.s3], mask.s3 ); }

Outcomes