I'm banging my head against the wall over a couple of ticket-lock algorithms and I wonder if any of the gurus here can offer advice.
Currently I have two 'bit-lock' and four ticket-lock implementations. In C under Linux on a 5700U. Only one is working, and I'm not all that confident in it. Some of the implementations error-out, for instance I detect the condition where a shared acquisition finds the lock in the exclusive state on release. At least two implementations appear to function correctly, however I'm losing updates to the SPR in the test program. That means if the test program increments a protected variable in the critical section, then 1M iterations for 8 threads should show a final value of 8M. Instead I get results like 7989210. Occasionally there is no loss.
My working theory is that complete updates to the cacheline are somehow being lost in the contended case.
The one that appears to work is a single-bit lock which essentially loads a 32-bit word, sets the high bit, and then does a cmpxchg to acquire, etc. It is implemented in C. The other single-bit lock is pretty much the same thing, but it has most of the critical cmpxchg update loop in assembler. It loses updated as described for the previous ticket lock.
Two of the locks are different implementations of the same many-reader-one-writer ticket lock. In both cases the lock is corrupted within about 2000 iterations. The simpler implementation appears to work if I use the successful one-bit-mutex as part of the solution, but loses updates as previously described.
Included here is the principle code for another implementation, however I cannot include the test program as it relies on several secondary code modules. I can make a tar file available on request, but it's 12K lines of code or so. Another wrinkle is that it is not FOSS, it's just unreleased commercial OSS.
#define ASM_ATOMIC_CMPXCHGW(ptr, old, new, result) \
asm __volatile__ ( \
"xor %[r], %[r] \n\t" \
"movw %[o], %%ax \n\t" \
"lock cmpxchgw %[n], (%[p]) \n\t" \
"jne 0f \n\t" \
"xor $1, %[r] \n\t" \
"0: \n\t" \
: [r] "=&r" (result), \
[o] "+r" (old), \
[p] "+r" (ptr) \
: [n] "r" (new) \
: "cc", "memory", "%ax")
The previous macro is assigned to "arch_atomic_cmpxchg16b". If I substitute a compiler intrinsic there is no change.
typedef struct {
u16 ticket;
u16 queue;
} t1lock;
#define T1LOCK_INITIALIZER (t1lock) { \
.ticket = 1, \
.queue = 0, \
}
__inline__ void t1lock_release(t1lock * oo)
{
oo->ticket++;
return;
}
attr_noinline void t1lock_relax(u32 nr_spin)
{
if_l (nr_spin < T1LOCK_DEFAULT_PAUSE_THRESH) {
arch_cpu_pause();
return;
}
if_l (nr_spin < T1LOCK_DEFAULT_YIELD_THRESH) {
sched_yield();
return;
}
us_sleep(1);
return;
}
inline bool t1lock_cmpxchg(t1lock * oo, u16 old, u16 new)
{
return arch_atomic_cmpxchg16b(&oo->queue, old, new);
}
u32 t1lock_acquire(t1lock * oo)
{
t1lock old, new;
u32 nr_spin = 0;
u16 ticket;
spin:
nr_spin++;
if_un (nr_spin == T1LOCK_DEADLOCK_THRESH) {
emit_t1lock_struct(stdout, oo);
notice("t1lock %p: possible deadlock detected.", oo);
}
old = new = *((volatile t1lock *) oo);
if_un ((old.queue + 1) == (old.ticket - 1)) {
t1lock_relax(nr_spin);
goto spin;
}
ticket = ++new.queue;
if_un (!t1lock_cmpxchg(oo, old.queue, new.queue)) {
t1lock_relax(nr_spin);
goto spin;
}
wait:
if (ticket != ((volatile t1lock *) oo)->ticket) {
t1lock_relax(nr_spin);
nr_spin++;
if_un (nr_spin == T1LOCK_DEADLOCK_THRESH) {
printf("[%u] Ticket: %u\n", self->tid, ticket);
emit_t1lock_struct(stdout, oo);
notice("t1lock %p: possible deadlock detected.", oo);
}
goto wait;
}
return nr_spin;
}
Typical output for the test program is: (At 400MHz as my CPU fan is borked)
Launched 2 writer threads.
Elapsed time: 1.3768 s
Transaction latency: 688.411ns
2 :wr_thread avg_nr_spin 1.001726 min 0 avg 569 max 164659
3 :wr_thread avg_nr_spin 1.001798 min 0 avg 566 max 150043
[1]:5ff72ca78fdc7:main() obj.value = 1999417; (nr_samples * nr_wthreads) = 2000000
Aborted
From the above we can see that the algorithm is contended and we lose just less than 600 transactions.
My understanding is that the above is basically a bog-standard ticket lock algorithm and should work. It's as if the LOCK prefix is being ignored. However, if I remove the lock prefix from the cmpxchg instruction the test program deadlocks.
A different ticket lock implementation collects statistics, such as the number of acquisitions and releases, number of cmpxchg failures, etc. Rarely I see the statistics corrupted despite the fact that the stats structure is instantiated in TLS and is updated after the operation (acq/rel) completes. In that case it looks like the CPU is becoming really confused for a moment, but not enough to outright crash. I should note for this one the lock structure is 5 fields in a 32-bit word.
But just looking at the above code, I fail to see a problem. I'm hoping someone can spot some subtle thing I'm doing wrong or can offer advice on how I might narrow-down the problem.
Your reply seems to indicate a technical nature. Needed to google what Ticket-Lock was.
From Github concerning Ticket Spinlocks and its Implementation: https://mfukar.github.io/2017/09/08/ticketspinlock.html
Found this Linux website about a Patch for Ticket-Lock: https://lore.kernel.org/linux-riscv/CAAhSdy1JHLUFwu7RuCaQ+RUWRBks2KsDva7EpRt8--4ZfofSUQ@mail.gmail.c...
The above image is just to see if this is related to Ticket-Lock issue you are having with Linux.
Github would be a good place for to get help since it is made for Developers and programmers.
Trying posting your Thread at AMD Developer's Forum and see where the Moderator assigns your thread to from here: https://community.amd.com/t5/newcomers-start-here/bd-p/newcomer-forum
Unfortunately you probably need to know some assembler to understand my query. The link you provided is related to the Linux kernel, which is a program that uses ticket locks sometimes.
Actually, I notice I thought I had been approved quickly for the dev forums and thought I was in one. Hopefully one of the mods gets around to it soon.