EPYC Discussions

soq · ‎07-04-2023

I'm banging my head against the wall over a couple of ticket-lock algorithms and getting inconsistent results.

Currently I have two 'bit-lock' and four ticket-spinlock implementations. In C under Linux, running on a 5700U. Only one algorithm is working, and I'm not all that confident in it. Some of the implementations error-out, for instance I detect the condition where a shared acquisition finds the lock in the exclusive state on release. At least two implementations appear to function correctly, however I'm losing updates to the SPR in the test program. That means if the test program increments a protected variable in the critical section, then 1M iterations for 8 threads should show a final value of 8M. Instead I get results similar to those shown below the code listing.

My working theory is that under some conditions, two threads will 'simultaneously' succeed, resulting in a loss of one thread's changes. Or something substantially similar. In the abstract, the guarantees for locked instructions are being violated.

The one that appears to work is a single-bit lock which essentially loads a 32-bit word, sets the high bit, and then does a cmpxchg to acquire, etc. It is implemented in C. The other single-bit lock is pretty much the same thing, but it has most of the critical cmpxchg update loop in assembler. It loses updated as described for the previous ticket lock.

Two of the locks are different implementations of the same many-reader-one-writer ticket spinlock. In both cases the lock is corrupted within about 2000 iterations. The simpler implementation appears to work if I use the successful one-bit-mutex as part of the solution, but loses updates as previously described.

Included here is the principle code for another implementation, however I cannot include the test program as it relies on several secondary code modules. I can make a tar file available on request, but it's 12K lines of code or so. Another wrinkle is that it is not FOSS, it's just unreleased commercial OSS.

#define ASM_ATOMIC_CMPXCHGW(ptr, old, new, result) \
asm __volatile__ ( \
"xor %[r], %[r]             \n\t" \
"movw %[o], %%ax            \n\t" \
"lock cmpxchgw %[n], (%[p]) \n\t" \
"jne 0f                     \n\t" \
"xor $1, %[r]               \n\t" \
"0:                         \n\t" \
: [r] "=&r" (result), \
  [o] "+r" (old), \
  [p] "+r" (ptr) \
: [n] "r" (new) \
: "cc", "memory", "%ax")

The previous macro is assigned to "arch_atomic_cmpxchg16b".  If I substitute a compiler intrinsic there is no change.

typedef struct {
  u16 ticket;
  u16 queue;
} t1lock;

#define T1LOCK_INITIALIZER (t1lock) { \
  .ticket = 1, \
  .queue = 0, \
}

__inline__ void t1lock_release(t1lock * oo)
{
  oo->ticket++;

  return;
}


attr_noinline void t1lock_relax(u32 nr_spin)
{
if_l (nr_spin < T1LOCK_DEFAULT_PAUSE_THRESH) {
  arch_cpu_pause();
  return;
}

if_l (nr_spin < T1LOCK_DEFAULT_YIELD_THRESH) {
  sched_yield();
  return;
}

us_sleep(1);

return;
}



inline bool t1lock_cmpxchg(t1lock * oo, u16 old, u16 new)
{
  return arch_atomic_cmpxchg16b(&oo->queue, old, new);
}


u32 t1lock_acquire(t1lock * oo)
{
t1lock old, new;
u32 nr_spin = 0;
u16 ticket;

spin:
  nr_spin++;

  if_un (nr_spin == T1LOCK_DEADLOCK_THRESH) {
    emit_t1lock_struct(stdout, oo);
    notice("t1lock %p: possible deadlock detected.", oo);
  }

  old = new = *((volatile t1lock *) oo);

  if_un ((old.queue + 1) == (old.ticket - 1)) {
    t1lock_relax(nr_spin);
    goto spin;
  }

  ticket = ++new.queue;

  if_un (!t1lock_cmpxchg(oo, old.queue, new.queue)) {
    t1lock_relax(nr_spin);
    goto spin;
  }

wait:
  if (ticket != ((volatile t1lock *) oo)->ticket) {
    t1lock_relax(nr_spin);
    nr_spin++;

    if_un (nr_spin == T1LOCK_DEADLOCK_THRESH) {
      printf("[%u] Ticket: %u\n", self->tid, ticket);
      emit_t1lock_struct(stdout, oo);
      notice("t1lock %p: possible deadlock detected.", oo);
    }

    goto wait;
  }

  return nr_spin;
}

Typical output for the test program is: (At 400MHz as my CPU fan is borked)

Launched 2 writer threads.
Elapsed time: 1.3768 s
Transaction latency: 688.411ns
2 :wr_thread avg_nr_spin 1.001726 min 0 avg 569 max 164659
3 :wr_thread avg_nr_spin 1.001798 min 0 avg 566 max 150043
[1]:5ff72ca78fdc7:main() obj.value = 1999417; (nr_samples * nr_wthreads) = 2000000

Aborted

From the above we can see that the algorithm is contended and we lose just less than 600 transactions.

My understanding is that the above is basically a bog-standard ticket lock algorithm and should work. It's as if the LOCK prefix is being ignored in some circumstances. However, note that if I remove the lock prefix from the cmpxchg instruction the test program deadlocks.

A different ticket lock implementation collects statistics, such as the number of acquisitions and releases, number of cmpxchg failures, etc. Rarely I see the statistics corrupted despite the fact that the stats structure is instantiated in TLS and is updated after the operation (acquire or release) completes. The result seen is consistent with the occasional total loss of memory stores between an acquisition operation and the subsequent release.

Just looking at the above code, I fail to see a problem. I'm hoping someone can spot some subtle thing I'm doing wrong or can offer advice on how I might narrow-down the problem more finely.

But if I have to go out on a limb, I think I'm going to need an RMA number from HP for my laptop, and not just for the fan.

(Edits for clarity,)

soq · ‎07-11-2023

Hello again,

I have found something interesting. I went an made a test program for posix mutexes lacking external dependencies and found that the Linux POSIX mutex facility functions properly on my system. This wasn't unexpected since my OS generally doesn't crash with any frequency under normal use. However I made one significant change to the subthread function. In the wr_thread() function the critical section included the following statement after lock acquisition: "temp = ++obj.value" (You can see the equivalent in the code included below.) The assignment is technically unnecessary and as far as I can tell merely results in a second write to the cacheline containing both 'temp' and 'obj':

$ objdump -t t2lock_test
[snip]
0000000000000000       F *UND*  0000000000000000              pthread_self@GLIBC_2.2.5
00000000000040c8 g     O .bss   0000000000000008              temp
00000000000040a0 g     O .data  0000000000000010              obj
0000000000000000       F *UND*  0000000000000000              pthread_join@GLIBC_2.2.5
[snip]

The sample array necessarily lives in other pages, so the critical section is making writes to two different cachelines whether or not "BROKEN is defined at compile-time.

So I changed it to "++obj.value" in the t1lock_test program. Much to my surprise the code now seems to function properly. Making this change in another test program for a different lock does not, however, result in the same success. I've not yet figured out why this change causes the program to succeed (or at least not fail with any frequency), but it does provide an additional avenue for analysis. My working theory now includes the possibility of a compiler bug, but I realize this is almost as unlikely as a hardware bug. In doing all this I noticed that my CMPXCHG assembler macro had the pointer listed as an output operand, but changing it to an input operand does not seem to affect anything:

I have re-written the t1lock_test program such that it no longer has any external dependencies and thus I can post a complete program. I am very interested to know whether this new program reproduces the bug I am experiencing for others. It is not as readable as I would like, but I wanted to make it as short as possible.

/*
   FILE:    t2lock_test.c

   COMPILE:

      gcc -pthread -std=c99 -Wall -Wextra -Werror -m64 -O3 -march=znver2 -ggdb \
                -o t2lock_test t2lock_test.c  -lrt

      Add -DBROKEN to generate a failing test and remember to set the architecture for your machine.

*/
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#include <unistd.h>
#include <stdbool.h>

#if !defined T1LOCK_PAUSE_THRESH
#define T1LOCK_PAUSE_THRESH   10
#endif

typedef unsigned long long u64;
typedef unsigned int u32;
typedef unsigned short u16;

#define expect_false(e)    __builtin_expect((e), 0)
#define expect_true(e)     __builtin_expect((e), 1)
#define if_un(e)           if (expect_false(e))
#define if_l(e)            if (expect_true(e))

#define die(fmt, arg...)   { fprintf(stdout, "%p:%s:  " fmt, \
                                     (void *) pthread_self(), __func__, ## arg); abort(); }

#define ASM_CPU_PAUSE        __asm__ volatile ("pause     \n\t" ::: )

typedef struct {
   u64   count;
   u64   cpu;
} tsc_timestamp;

#define ASM_CPU_RDTSCP(value, cpu)   __asm__ volatile (                    \
                                        "rdtscp                    \n\t"   \
                                        "shl    $0x20, %%rdx       \n\t"   \
                                        "or     %%rax, %%rdx       \n\t"   \
                                        "movq   %%rdx, %[v]        \n\t"   \
                                        "movq   %%rcx, %[c]        \n\t"   \
                                        : [v] "=g" (value),                \
                                          [c] "=g" (cpu)                   \
                                        :                                  \
                                        : "%rax", "%rcx",                  \
                                          "%rdx", "memory", "cc")

#define BEGIN_TSC_TIMING                  \
   {                                      \
      tsc_timestamp ts1;                  \
      tsc_timestamp ts2;                  \
      ASM_CPU_RDTSCP(ts1.count, ts1.cpu); \

#define END_TSC_TIMING(result)            \
      ASM_CPU_RDTSCP(ts2.count, ts2.cpu); \
      if(ts1.cpu != ts2.cpu)              \
         result = 1;                      \
      else                                \
         result = ts2.count - ts1.count;  \
   }

#define ASM_ATOMIC_CMPXCHGW(ptr, old, new, result) __asm__ volatile (      \
                           "xor       %[r], %[r]               \n\t"       \
                           "movw      %[o], %%ax               \n\t"       \
                      "lock cmpxchgw  %[n], (%[p])             \n\t"       \
                           "jne  0f                            \n\t"       \
                           "xor  $1,  %[r]                     \n\t"       \
                           "0:                                 \n\t"       \
                           : [r] "=&r" (result),                           \
                             [o] "+r" (old)                                \
                           : [p] "r" (ptr),                                \
                             [n] "r" (new)                                 \
                           : "cc", "memory", "%ax")

typedef struct {
   u16 ticket;
   u16 queue;
} t1lock;

#define T1LOCK_INITIALIZER (t1lock) {     \
   .ticket = 1,                           \
   .queue = 0                             \
}

__inline__ void   t1lock_release(t1lock * oo)
{
   oo->ticket++;

   return;
}

__attribute__((noinline)) void   t1lock_relax(u32  nr_spin)
{
   if_l (nr_spin < T1LOCK_PAUSE_THRESH) {
      ASM_CPU_PAUSE;
      return;
   }

   if_l (nr_spin < 65536) {
      sched_yield();
      return;
   }

   if_un (nr_spin == 65536)
      fprintf(stdout, "Thread may be deadlocked.\n");

   sleep(1);

   return;
}

inline bool t1lock_cmpxchg(t1lock * oo, u16 old, u16 new)
{
   u32   result;

   ASM_ATOMIC_CMPXCHGW(&oo->queue, old, new, result);

   return result;
}

u32   t1lock_acquire(t1lock * oo)
{
   t1lock old, new;
   u32   nr_spin = 0;
   u16   ticket;

spin:
   nr_spin++;

   old = new = *((volatile t1lock *) oo);

   if_un ((old.queue + 1) == (old.ticket - 65400)) {
      t1lock_relax(nr_spin);
      goto spin;
   }

   ticket = ++new.queue;

   if_un (!t1lock_cmpxchg(oo, old.queue, new.queue)) {
      t1lock_relax(nr_spin);
      goto spin;
   }

wait:
   if (ticket != ((volatile t1lock *) oo)->ticket) {
      t1lock_relax(nr_spin);
      nr_spin++;
      goto wait;
   }

   return nr_spin;
}

u64   timestamp(void) /* microseconds */
{
   struct timespec ts;
   clock_gettime(CLOCK_MONOTONIC, &ts);
   return (ts.tv_sec * 1000000000 + ts.tv_nsec) / 1000;
}

typedef struct {
   t1lock   lock;
   u64      value;
} spr_t;

spr_t    obj = { .lock = T1LOCK_INITIALIZER, .value = 0 };

volatile u64   temp;  /* Unprotected scribble var */

volatile int   start_flag = 0;

typedef struct { 
   u32   cycles;
   u32   nr_spin;
} sample_t;

void * wr_thread(void * data)
{
   u64      T;
   sample_t *samples;
   u32      nr_rounds, i, nr_spin;

   nr_rounds = (u64) data;

   samples = malloc(nr_rounds * sizeof(sample_t));
   if (!samples)
      die("Unable to allocate sample buffer - %m");

   while (!start_flag)
      ASM_CPU_PAUSE;

   for (i = 0; i < nr_rounds; i++) {
      BEGIN_TSC_TIMING;
      nr_spin = t1lock_acquire(&obj.lock);
#if defined BROKEN
      temp = ++obj.value;
#else
      ++obj.value;
#endif
      t1lock_release(&obj.lock);
      END_TSC_TIMING(T);

      if (T <= 28)   /* rdtscp is usually about 28 cycles */
         T = 1;
      else
         T -= 28;

      samples[i].cycles = T;
      samples[i].nr_spin = nr_spin;
   }

   return (void *) samples;
}

void  process_thread_result(sample_t * samples, u32 nr_samples)
{
   u64      max = 0;
   u64      avg = 0;
   double   nr_spin = 0.0;
   u32      max_nr_spin = 0;
   u32      i;

   if (!nr_samples || !samples)
      die("Usage.");

   for (i = 0; i < nr_samples; i++) {
      if (max < samples[i].cycles)
         max = samples[i].cycles;

      if (max_nr_spin < samples[i].nr_spin)
         max_nr_spin = samples[i].nr_spin;

      avg += samples[i].cycles;
      nr_spin += samples[i].nr_spin;
   }

   avg /= nr_samples;
   nr_spin /= nr_samples;

   fprintf(stdout, "%p:  avg: %-4qu   max: %-4qu   nr_spin_avg: %2.4f   max_nr_spin: %u\n", samples, avg, max, nr_!

   free(samples);

   return;
}

int main(int argc, char ** argv)
{
   u64         start = 0, stop = 0;
   u32         nr_threads = 0;
   u32         nr_samples = 0;
   u32         i;
   pthread_t * threads;
   sample_t ** results;


   if (argc != 3) {
usage:
      printf("Usage: %s nr_threads nr_iter\n", argv[0]);
      printf("          nr_threads <= 1000\n");
      exit(1);
   }

   nr_threads = strtoul(argv[1], NULL, 0);

   if ((nr_threads > 1000) || !nr_threads)
      goto usage;

   nr_samples = strtoul(argv[2], NULL, 0);

   if (!nr_samples)
      goto usage;
   
   if (!(threads = malloc(sizeof(pthread_t) * nr_threads)))
      die("Unable to malloc array - %m");

   if (!(results = malloc(sizeof(sample_t *) * nr_threads)))
      die("Unable to malloc results array - %m");

   for (i = 0; i < nr_threads; i++) {
      if (pthread_create(&threads[i], NULL, wr_thread, ((void *) (u64) nr_samples)) == -1)
         die("Unable to create thread = %m");
   }

   fprintf(stdout, "Launched %u threads.\n", nr_threads);

   start = timestamp();
   
   start_flag = 1;

   for (i = 0; i < nr_threads; i++)
      pthread_join(threads[i], (void *) &results[i]);

   stop = timestamp();

   for (i = 0; i < nr_threads; i++)
      process_thread_result(results[i], nr_samples);

   printf("t2lock_test.c:\nElapsed time: %2.4f s\n", ((double) (stop - start)) / 1000000.0);
   printf("Transaction latency: %1.3fns\n", (((double) (stop - start)) * 1000.0) / ((double) (nr_samples * nr_thre!

   if (obj.value != (nr_samples * nr_threads))
      die("obj.value = %qu; (nr_samples * nr_threads) = %u\n", obj.value, (nr_samples * nr_samples));

   free(threads);
   free(results);

   exit(0);
}

Hopefully the line-wrap does not cause problems.

I am still about as mystified as I was last week so any input will be greatly appreciated.

soq · ‎07-11-2023

In the update I posted, which has not yet been approved, I misspoke and realize that the 'tem' and 'obj' variables occupy adjacent cachelines, so the logical conclusion is that the proximal cause of the errors I am experiencing is likely related to that fact.

I would have edited the post, but I can no longer find the link to it.

soq · ‎07-25-2023

Ok... First off, I realize that the code I posted has two incorrect lines that were clipped by my editor window. There are two f/printf lines ending with '!', so I might as well include the code again.

I apologize for the error.

That said, this morning I read about CVE-2023-20593, which does not specifically address my issue, but piqued my curiosity since it affects my cores. Tavis Ormandy at Google describes the bug: https://lock.cmpxchg8b.com/zenbleed.html

In essence, he found a bug in the zen2 architecture where speculative execution combined with missed branch-prediction resulted in a data leak allowing data leakage from other processors to be caught in the upper-half of SIMD registers. There are probably other implications.

In my case, something is apparently happening to the cmpxchg operations that seemingly results in two threads conflicting in rare instances. I have one MCS lock algorithm that fails with p ~< 1:5000000.

The linked article describes creating an 'oracle' with code instrumented with memory fence operations to validate the correct operation of the algorithm by removing a lot of superscalar effects. So I inserted a fence operation at the start of the cmpxchg macro, and that "fixed" the problem. It doesn't seem to matter which fence operation is used. However this success is not repeated for algorithms that are more complex, such as a read-write ticket lock that fails due to lock structure corruption -- recall that the code included here has a slightly less intrusive error that does not corrupt the lock structure as such, only protected data. I currently have about fourteen test programs for variations on three algorithms and I will find out which ones remain broken shortly.

It is unclear to me whether the problem I am experiencing is directly related to the details of the CVE analysis, or whether a different stage of the pipeline is at fault. Nevertheless it is encouraging in a sense to find out that memory barriers have a mitigating effect.

The linked article describes a fix involving setting a MSR on all processors to mitigate the bug. I will attempt that once I figure out how to coerce udev to create the /dev/cpu/* entries for rdmsr and wrmsr on my machine. In the CVE AMD projects a microcode update for the problem in December.

While the data I have still seems to point to a hardware bug I have no other machines currently to test this code on different hardware. I suppose technically my specific machine could be defective in some subtle way, but I have no way of establishing this on my own. Perhaps now that I have example code that will compile properly some kind soul could test it on their machine to verify whether or not this is a problem that affects other [AMD] machines. (The possibility still exists that my code is subtly wrong.) I simply do not have anything like FAANG resources ATM so I'm stuck with only the HP laptop.

/*
   FILE:    t2lock_test.c

   COMPILE:

      gcc -pthread -std=c99 -Wall -Wextra -Werror -m64 -O3 -march=znver2 -ggdb \
                -o t2lock_test t2lock_test.c  -lrt

      Add -DBROKEN to generate a failing test and remember to set the architecture for your machine.

*/
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#include <unistd.h>
#include <stdbool.h>

#if !defined T1LOCK_PAUSE_THRESH
#define T1LOCK_PAUSE_THRESH   10
#endif

typedef unsigned long long u64;
typedef unsigned int u32;
typedef unsigned short u16;

#define expect_false(e)    __builtin_expect((e), 0)
#define expect_true(e)     __builtin_expect((e), 1)
#define if_un(e)           if (expect_false(e))
#define if_l(e)            if (expect_true(e))

#define die(fmt, arg...)   { fprintf(stdout, "%p:%s:  " fmt, \
               (void *) pthread_self(), __func__, ## arg); abort(); }

#define ASM_CPU_PAUSE        __asm__ volatile ("pause     \n\t" ::: )

typedef struct {
   u64   count;
   u64   cpu;
} tsc_timestamp;

#define ASM_CPU_RDTSCP(value, cpu)   __asm__ volatile (                    \
                                        "rdtscp                    \n\t"   \
                                        "shl    $0x20, %%rdx       \n\t"   \
                                        "or     %%rax, %%rdx       \n\t"   \
                                        "movq   %%rdx, %[v]        \n\t"   \
                                        "movq   %%rcx, %[c]        \n\t"   \
                                        : [v] "=g" (value),                \
                                          [c] "=g" (cpu)                   \
                                        :                                  \
                                        : "%rax", "%rcx",                  \
                                          "%rdx", "memory", "cc")

#define BEGIN_TSC_TIMING                  \
   {                                      \
      tsc_timestamp ts1;                  \
      tsc_timestamp ts2;                  \
      ASM_CPU_RDTSCP(ts1.count, ts1.cpu); \

#define END_TSC_TIMING(result)            \
      ASM_CPU_RDTSCP(ts2.count, ts2.cpu); \
      if(ts1.cpu != ts2.cpu)              \
         result = 1;                      \
      else                                \
         result = ts2.count - ts1.count;  \
   }

#define ASM_ATOMIC_CMPXCHGW(ptr, old, new, result) __asm__ volatile (      \
                           "lfence                             \n\t"       \
                           "xor       %[r], %[r]               \n\t"       \
                           "movw      %[o], %%ax               \n\t"       \
                      "lock cmpxchgw  %[n], (%[p])             \n\t"       \
                           "jne  0f                            \n\t"       \
                           "xor  $1,  %[r]                     \n\t"       \
                           "0:                                 \n\t"       \
                           : [r] "=&r" (result),                           \
                             [o] "+r" (old)                                \
                           : [p] "r" (ptr),                                \
                             [n] "r" (new)                                 \
                           : "cc", "memory", "%ax")

typedef struct {
   u16 ticket;
   u16 queue;
} t1lock;

#define T1LOCK_INITIALIZER (t1lock) {     \
   .ticket = 1,                           \
   .queue = 0                             \
}

__inline__ void   t1lock_release(t1lock * oo)
{
   oo->ticket++;

   return;
}

__attribute__((noinline)) void   t1lock_relax(u32  nr_spin)
{
   if_l (nr_spin < T1LOCK_PAUSE_THRESH) {
      ASM_CPU_PAUSE;
      return;
   }

   if_l (nr_spin < 65536) {
      sched_yield();
      return;
   }

   if_un (nr_spin == 65536)
      fprintf(stdout, "Thread may be deadlocked.\n");

   sleep(1);

   return;
}

inline bool t1lock_cmpxchg(t1lock * oo, u16 old, u16 new)
{
   u32   result;

   ASM_ATOMIC_CMPXCHGW(&oo->queue, old, new, result);

   return result;
}

u32   t1lock_acquire(t1lock * oo)
{
   t1lock old, new;
   u32   nr_spin = 0;
   u16   ticket;

spin:
   nr_spin++;

   old = new = *((volatile t1lock *) oo);

   if_un ((old.queue + 1) == (old.ticket - 65400)) {
      t1lock_relax(nr_spin);
      goto spin;
   }

   ticket = ++new.queue;

   if_un (!t1lock_cmpxchg(oo, old.queue, new.queue)) {
      t1lock_relax(nr_spin);
      goto spin;
   }

wait:
   if (ticket != ((volatile t1lock *) oo)->ticket) {
      t1lock_relax(nr_spin);
      nr_spin++;
      goto wait;
   }

   return nr_spin;
}

u64   timestamp(void) /* microseconds */
{
   struct timespec ts;
   clock_gettime(CLOCK_MONOTONIC, &ts);
   return (ts.tv_sec * 1000000000 + ts.tv_nsec) / 1000;
}

typedef struct {
   t1lock   lock;
   u64      value;
} spr_t;

spr_t    obj = { .lock = T1LOCK_INITIALIZER, .value = 0 };

volatile u64   temp;  /* Unprotected scribble var */

volatile int   start_flag = 0;

typedef struct {
   u32   cycles;
   u32   nr_spin;
} sample_t;

void * wr_thread(void * data)
{
   u64      T;
   sample_t *samples;
   u32      nr_rounds, i, nr_spin;

   nr_rounds = (u64) data;

   samples = malloc(nr_rounds * sizeof(sample_t));
   if (!samples)
      die("Unable to allocate sample buffer - %m");

   while (!start_flag)
      ASM_CPU_PAUSE;

   for (i = 0; i < nr_rounds; i++) {
      BEGIN_TSC_TIMING;
      nr_spin = t1lock_acquire(&obj.lock);
#if defined BROKEN
      temp = ++obj.value;
#else
      ++obj.value;
#endif
      t1lock_release(&obj.lock);
      END_TSC_TIMING(T);

      if (T <= 28)   /* rdtscp is usually about 28 cycles */
         T = 1;
      else
         T -= 28;

      samples[i].cycles = T;
      samples[i].nr_spin = nr_spin;
   }

   return (void *) samples;
}

void  process_thread_result(sample_t * samples, u32 nr_samples)
{
   u64      max = 0;
   u64      avg = 0;
   double   nr_spin = 0.0;
   u32      max_nr_spin = 0;
   u32      i;

   if (!nr_samples || !samples)
      die("Usage.");

   for (i = 0; i < nr_samples; i++) {
      if (max < samples[i].cycles)
         max = samples[i].cycles;

      if (max_nr_spin < samples[i].nr_spin)
         max_nr_spin = samples[i].nr_spin;

      avg += samples[i].cycles;
      nr_spin += samples[i].nr_spin;
   }

   avg /= nr_samples;
   nr_spin /= nr_samples;

   fprintf(stdout, "%p:  avg: %-4qu   max: %-4qu   nr_spin_avg: %2.4f   max_nr_spin: %u\n",
           samples, avg, max, nr_spin, max_nr_spin);

   free(samples);

   return;
}

int main(int argc, char ** argv)
{
   u64         start = 0, stop = 0;
   u32         nr_threads = 0;
   u32         nr_samples = 0;
   u32         i;
   pthread_t * threads;
   sample_t ** results;


   if (argc != 3) {
usage:
      printf("Usage: %s nr_threads nr_iter\n", argv[0]);
      printf("          nr_threads <= 1000\n");
      exit(1);
   }

   nr_threads = strtoul(argv[1], NULL, 0);

   if ((nr_threads > 1000) || !nr_threads)
      goto usage;

   nr_samples = strtoul(argv[2], NULL, 0);

   if (!nr_samples)
      goto usage;
   
   if (!(threads = malloc(sizeof(pthread_t) * nr_threads)))
      die("Unable to malloc array - %m");

   if (!(results = malloc(sizeof(sample_t *) * nr_threads)))
      die("Unable to malloc results array - %m");

   for (i = 0; i < nr_threads; i++) {
      if (pthread_create(&threads[i], NULL, wr_thread, ((void *) (u64) nr_samples)) == -1)
         die("Unable to create thread = %m");
   }

   fprintf(stdout, "Launched %u threads.\n", nr_threads);

   start = timestamp();
   
   start_flag = 1;

   for (i = 0; i < nr_threads; i++)
      pthread_join(threads[i], (void *) &results[i]);

   stop = timestamp();

   for (i = 0; i < nr_threads; i++)
      process_thread_result(results[i], nr_samples);

   printf("t2lock_test.c:\nElapsed time: %2.4f s\n", ((double) (stop - start)) / 1000000.0);
   printf("Transaction latency: %1.3fns\n", (((double) (stop - start)) * 1000.0)
          / ((double) (nr_samples * nr_threads)));

   if (obj.value != (nr_samples * nr_threads))
      die("obj.value = %qu; (nr_samples * nr_threads) = %u\n", obj.value, (nr_samples * nr_samples));

   free(threads);
   free(results);

   exit(0);
}

EPYC Discussions

Memory coccuption and atomic operations