Archives Discussions

flavius · ‎04-12-2011

How to implement them?

Hi,

I have tried to use spinlocks within my kernel code. Similary as in this article, I have used this code:

void lock(__global int *the_lock)
{
        int occupied = atom_xchg(the_lock, 1);
        while (occupied) {
                occupied = atom_xchg(the_lock, 1);
        }
}

void unlock(__global int *the_lock)
{
        int occupied = atom_xchg(the_lock, 0);
}

and initialized the spinlock to 0, issued a barrier and simply locked and unlocked inside the kernel. The code works as it should when ran with both global and local workgroup size 1, but with 2 it simply hangs as if the kernel looped forever. If I tried to lock and unlock the spinlock inside condition get_local_id() == 0 and then get_local_id() == 1, it works. With spinlock in local memory the problem repeats, adding some memory fences where it is possible has not helped.
After some time I have realized what's the problem: The first work-item succesfully locks the spinlock, the second finds it locked (so far everything is OK). However, because of the SIMD fashion of instructions execution, the cycle executed by second work-item blocks also the first work-item (because there is no scheduling as on CPU and all work-items execute the same instructions). This causes the first work-item to never unlock the spinlock and therefore it hangs.
Is this explanation correct? How should I implement the spinlock correctly? I tried to rewrite it this way:

for(; {
    if (atom_xchg(the_lock, 1) == 0) {
        // critical section
        atom_xchg(the_lock, 0);
        break;
    }
}

so the unlock code would be inside the if but it does not work either (god knows what the compiler does with that). What should I do?

To be honest, I have experienced these hangs with NVidia GeForce GTX 580, but on my ATI Radeon Mobile 4500 HD I don't have atomics to test and currently I am not in possession of another ATI GPU. Do ATI GPUs experience the same problem?

Btw, I know that spinlocks would be a great performance dropdown, but currently I'd like them for debugging (to try if the problem comes from parallelism in particular sections of kernel). And maybe in future I could really need them.

Jawed · ‎04-12-2011

As you correctly observed, the spinlock has to be set once by a work group.

In this scenario you simply want to test if the local ID returned by get_local_id == 0. If your work group is 2- or 3-dimensional then you need to test that it == 0 in all dimensions.

Once you know you are the zeroth work item in a work group, you can safely manipulate the atomic.

Alternatively you can make each work item in a work group have a private lock. Simply use the work item's local ID as an index into a buffer of locks, i.e. the_lock + local_id. Again, if you have 2- or 3-dimensional work groups then you should linearise these into a single, scalar, offset to add to the the_lock pointer.

I've not implemented any of this kind of stuff, but I suspect this second solution is "faster" because it has less control flow.

flavius · ‎04-12-2011

Maybe we haven't understood ourselves correctly, but I don't want only the first workitem in workgroup to enter the critical section, I want that every workitem can enter it, but at one moment only single workitem can execute it.

The get_local_id() == 0 stuff should only show that the implementation is correct if the workitems are not competing for the spinlock, actually I'd like to use it in a competition (otherwise the spinlock wouldn't be required, of course).

What would be separate lock for every work item good for? For spinlocks, the ability to be forced to wait is essential. Single critical section (on single data) means I need single lock to be guarding it.

Jawed · ‎04-12-2011

Ah, sorry, I misunderstood.

Then you want something like:

http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm

I can't give an insight into per work-item mutex I'm afraid.

flavius · ‎04-12-2011

Originally posted by: Jawed Ah, sorry, I misunderstood.

Then you want something like:

http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm

I can't give an insight into per work-item mutex I'm afraid.

There is still one problem in the algorithm: the customers are required to WAIT. If I try to wait (actively in a loop) on GPU, others (who could execute from the logically parallel point of view, but not because of HW limitations) will be stalled. Or at least this is my explanation why the spinlock from my first post does not work, the algorithm does not evade it.

Maybe this is wrong explanation (because the second proposed algorithm from my first post does not work neither), but I haven't figured out anything better.

nou · ‎04-12-2011

just idea.

for(int i=0;i<get_global_size(0);i++) { if(i==get_global_id(0)) { //critical section } barrier(); }

flavius · ‎04-12-2011

2 nou:

That would work, of course, but it is not a good solution: The critical section would take O(N) time (where N is the number of work-items) even if I need to synchronize two of the work-items - the time is proporional to the total number of work-items, not just the work-items asking for the lock.

Imagine that I'd have generated a vector x of N indices to another vector y of length M (much larger than N). I have another vector z of locks for each element in vector y (with length M). I would like to somehow update y[x[get_local_id(0)]]. Because the M is much larger than N, the probability of collision (x[ i ] == x [ j ]) is low, so if the spinlocks worked correctly, There would be probably only little overhead (but the collision is possible, therefore we need to use locks). However, with your solution, we would have to serialize all the updates, what would be slow and we would lose all the advantages of mass-parallel GPUs.

flavius · ‎04-12-2011

Heureka! I've got it...

int waiting = 1;
while (waiting) {
    if (try_lock(the_lock)) {
    // critical section
        unlock(the_lock);
        waiting = 0;
    }
}

This is working on current version of NVidia's compiler, because I've forced the program to really execute the critical section first. However, the compiler could optimize some jumps inside so that it actually breaks.

yurtesen · ‎06-15-2011

I tried the same code from the article mentioned in the 1st post and it hanged as well. It works when ran on CPU though 🙂

tonyo_au · ‎07-04-2011

I have tried this spin lock code, it works if you only try to use the lock once per work group (eg local workgroup id == 0); otherwise it hangs the machine.

I don't believe it should hang the machine

notzed · ‎07-05-2011

I don't think spinlocks at the thread level are really very useful - since gpu threads aren't independent.

In many cases you can probably find other solutions which will work better and probably be easier to code anyway.

1. Map the work 1:1 to each thread, so they're all busy, but instead of writing the result to the target, write it to an auxilliary array together with the real index. Another kernel then tallies them up - even if it has to be serial it does so little work the inefficiency isn't too costly and you probably win by making the first kernel so simple and completely parallel (and if the device supports multiple queues and concurrent execution, it need not be inefficient either).

2. Generate a histogram of target indices. Loop on those, or otherwise serialise multiples if they occur.

2a. (for example) Use the histogram counter to decide which 'batch' is used to calculate the result and generate a work matrix based on that.

3. If generating 'n' results where 'n' is data dependent, use an atomic counter to generate a result index if you need to write a result, and then use that to index the result array. Use a separately tracked limit if you need to limit the array output size.

4. Use integer atomics for accumulation/whatever. Possibly storing fixed-point values.

5. It still seems like the xchg thing should let you do something like this, but instead of using it to determine a branch, just use it to decide whether to store the result or not but calculate asssuming you can. unconditionally unlock afterwards.

The problem with 5 is that you still have the task of working out whether all the threads in the workgroup actually completed their work, and if they haven't looping again for the (presumably) few that haven't (perhaps multiple times?). That can be a pain (e.g. if anywhere inside the loop there is a barrier), and it probably can't be done very efficiently, and worst-case is serial with lots of overheads.

LeeHowes · ‎07-05-2011

This is why I never use the word "thread" to describe a work item in a talk. People need to be very careful to remember that something like a spin lock will ever only work at the real thread level - ie the wavefront.

Remember, too, that it's trivial to order your work items within a wavefront. So if you do a spin lock whereby work item 0 in each wave accesses the lock, when that work item gets access to the critical section the entire wave does. At that point you can serialise within the wave:

if( get_local_id(0) == 0 ) {...}

if( get_local_id(0) == 1 ) {...}

if( get_local_id(0) == 2 ) {...}

etc (or with a loop or however else you do it). You know that two work items in the same wave can't be in the same code at once so you're safe.

Even this relies on knowing how the work items are going to map to hardware, though. You probably need to add fences to make sure the compiler doesn't re-order that code thinking it's independent, and you need to make sure your work item 0 from the wave always makes sense given varying wave sizes. So you probably want to launch a kernel with 32 work items per lock for NVIDIA and Cedar, 64 for higher end AMD, and then whatever is appropriate depending on how other compilers target the vector architectures. Obviously doing it by work group is then cleaner than breaking a given group into waves, and you can use barriers to make the above sequence cleaner, but of course then your locks can only be global and you can't do any LDS locking and communication.

tonyo_au · ‎07-05-2011

Thanks for that extra info Lee. I only did it as an exercise to understand locking a bit better. It is nice to know why you can only have one lock per wavefront.

If I have 3 things to update inside the locked code, is there any advantage in having 3 locks, one around each item to update - or am I complicating things for no real gain.

Tony

tonyo_au · ‎07-05-2011

A little off topic; but Lee you seem to understanding locking - if you have time could you look at the problem I have under "help with locks and global memory" and explain what I am doing wrong

LeeHowes · ‎07-05-2011

That depends on the granularity of access. If you're always updating all three in one go then probably you want to leave it coarse grained. If you might update one or other unpredictably then maybe it makes sense to make the locks more fine grained. The question is whether you're going to gain any concurrency (and hence parallelism over multiple cores) from finer grained locking. if not because they'd all hit the first first and then the second you'd just gain overhead from the extra locks.

Even if you do gain concurrency, the overhead might outweigh it.

Choosing the right locking granularity is never easy, unfortunately.

In answer to your question I'll take a look if I get a chance. I have 13 book chapters to finalise in the next 48 hours so it might be tight. 🙂 "Heterogeneous Computing with OpenCL", it comes highly recommended by me 😉

tonyo_au · ‎07-05-2011

Thanks Lee,

your book looks interesting - I look forward to getting it when it is released