Bitonic sort is a fast way to arange array in GPU. ShareMemory is benefit for small array to provide a synchronization,But there is no way to make a synchronization under different thread group.I have tried to do a spin in thread to wait for data has been processed.But I got result fiiled with zero.I guess it can't be supported in GPU.Here is a part code.
void CompareAndExchange(uint fence, uint Top, uint Bottom)
{
while (true)
{
//wait for flag has been add 1
if (Atomic[Top] == fence && Atomic[Bottom] == fence)
break;
}
if (SortArray[Top]>SortArray[Bottom] && Bottom<ArraySize)
{
uint Temp = SortArray[Top];
SortArray[Top] = SortArray[Bottom];
SortArray[Bottom] = Temp;
}
//add 1 for next other thread
++Atomic[Top];
++Atomic[Bottom];
}