Discussion created by flavius on Apr 12, 2011
Latest reply on Jul 5, 2011 by tonyo_au
How to implement them?


I have tried to use spinlocks within my kernel code. Similary as in this article, I have used this code:

void lock(__global int *the_lock)
int occupied = atom_xchg(the_lock, 1);
while (occupied) {
= atom_xchg(the_lock, 1);

void unlock(__global int *the_lock)
int occupied = atom_xchg(the_lock, 0);

and initialized the spinlock to 0, issued a barrier and simply locked and unlocked inside the kernel. The code works as it should when ran with both global and local workgroup size 1, but with 2 it simply hangs as if the kernel looped forever. If I tried to lock and unlock the spinlock inside condition get_local_id() == 0 and then get_local_id() == 1, it works. With spinlock in local memory the problem repeats, adding some memory fences where it is possible has not helped.
After some time I have realized what's the problem: The first work-item succesfully locks the spinlock, the second finds it locked (so far everything is OK). However, because of the SIMD fashion of instructions execution, the cycle executed by second work-item blocks also the first work-item (because there is no scheduling as on CPU and all work-items execute the same instructions). This causes the first work-item to never unlock the spinlock and therefore it hangs.
Is this explanation correct? How should I implement the spinlock correctly? I tried to rewrite it this way:

for(; {
if (atom_xchg(the_lock, 1) == 0) {
// critical section
(the_lock, 0);

so the unlock code would be inside the if but it does not work either (god knows what the compiler does with that). What should I do?

To be honest, I have experienced these hangs with NVidia GeForce GTX 580, but on my ATI Radeon Mobile 4500 HD I don't have atomics to test and currently I am not in possession of another ATI GPU. Do ATI GPUs experience the same problem?

Btw, I know that spinlocks would be a great performance dropdown, but currently I'd like them for debugging (to try if the problem comes from parallelism in particular sections of kernel). And maybe in future I could really need them.