A Pair of (somebody else’s) Concurrency Bugs

For this blog – a pair of (somebody else’s) concurrency bugs.  Well, more specifically spin-loop bugs…

Java Grande Forum – LUFact Tournament Barrier

The Java Grande Forum benchmark suite includes a parallel LU decomposition which uses a Tournament Barrier synchronization mechanism.  However The JGF implementation is buggy in at least 2 ways.  In practice it works when there are fewer (or equal) worker threads than CPUs.  If there are more workers than CPUs then it only works if the OS is polite (i.e. fair scheduling) and the JVM is dumb (no hoisting of constant array loads from a spin-loop).  Azul’s JVM and Linux’s default CFS failed (succeeded?  smart JVM optimizations and aggressive throughput scheduling) and the combination can make LUFact hang indefinitely.  Really I never saw it hang forever, but runtimes would be 2x to 10x worse than the best-case.  Enough Gloom and Doom – So what’s the bug?

The core of LUFact’s Barrier mechanism is an evil little loop that waits until all the worker threads are done… by spin-waiting.  Spin-waiting works when you have more real CPUs than threads spinning because threads needing to do Real Work ALSO get on a CPU.  Typically spinning is very bad if you don’t have enough CPUs.  If some threads are spinning on a condition… then they are making apparent work and using real CPU time to do that apparent work… preventing the Real Work from getting done.  Example time!  Suppose I run these 2 threads on a 1 CPU machine:

  Thread 1:  while (!flag ) /*nothing*/ ; return; // Spin, spin, spin until flag gets set...
  Thread 2:  for( i=0; i<LONG_TIME; i++ ) ...do Real Work...; flag = true; return; // work done!!

Now suppose the OS scheduler decides to give each thread 50% of the one CPU.  Then each time Thread 1 gets on the CPU and runs… it only spins.  On Thread 2’s turn it works it’s little butt off.. but with only half of the CPU time.  Runtime for the total program is 2x the hypothetical best scheduling.  Instead suppose the OS decides to only a thread run until that thread blocks (e.g. on I/O) or exits the program, and starts with Thread 1… then Thread 2 might NEVER RUN and this can program hang.  However on a 2 CPU machine both threads run full-tilt until Thread 2 gets done and flips the flag… so the program exits as soon as Thread 2 does all the Real Work.  Let me repeat that: ONE CPU machine ==> program hangs;  TWO CPU machine ==> program runs fine.

Lets make the example a little more realistic.  Suppose we want to divvy up a large pile of work amongst N threads and then signal when ALL the work is done.  Divvy’ing up the work is a simple divide-by-N, but after each Thread has done his 1/Nth share – they need to signal & wait for all other threads to do done.  SOMEBODY needs to notice when all threads are done and signal.. and that’s the job of a Barrier synchronization.  So here’s a little program that has a barrier sync in the middle:

  volatile boolean barrier[];
  void do_work( int tid/*thread id*/, int num_threads ) {
    int share = N/num_threads;
    for( int i=tid*share; i<tid*(share+1)-1; i++ )
      A[i] = fcn(A[i]);    // work on my share
    // Work Done!  Now comes the Barrier Synchronization
    if( tid != 0 ) {       // Thread zero is SPECIAL...
      barrier[tid] = true; // Reached the Barrier!  This thread (which is NOT thread zero) is done!
      while( !barrier[0] ) /* nothing */; // Spin until Thread zero reports done

    } else {               // Thread zero has more work to do than other threads...

      int done_threads=1;  // Thread zero spins until all other threads are done
      while( done_threads < num_threads ) { // Thread 0 spins until all threads are done
        done_threads = 1;
        for( int i=1; i<num_threads; i++ )
          if( barrier[tid] == true )
            done_threads++;
      }
      // At this point Thread 0 has noticed that all threads are reporting as
      // 'reached the barrier' and will now exit.
      barrier[0] = true;
    }
  }

This code is a simple version of what LUFact’s Barrier is trying to do, and it has TWO failure modes.  We’ve been talking about the first failure mode: threads spin when they are out of Real Work, and that spinning burns CPU time, taking it away from other threads that have Real Work.  Indeed, when I run this with 32 threads on a 16-way Intel Nehalem box I routinely see runtimes from 2x to 10x worse than the expected best case (achieved when there are exactly as many threads as CPUs).  The coders of the LUFact barrier were not so naive as this algorithm suggests; they added a Thread.yield() call in the middle of the spin-loops.  But…

Thread.Yield is Broken

But, alas, Thread.yield() is Broken.  In theory it is supposed to ‘yield’ the current running thread to some other runnable thread on the same CPU letting somebody else do work instead of this thread spinning.  THIS DOES NOT WORK for several reasons.

  1. There are zillions of spinning threads.  If the OS actually schedules another thread there’s is only a small chance you’ll get one with Real Work to do instead of getting Yet Another Spinning Thread… which will simply call Yield again.
  2. All those threads call Yield means Yield gets slammed… means the OS thread scheduler gets slammed with constant context-switch requests.  You spend all of of your time in the kernel context-switching.
  3. The OS likes to switch between a small number of threads per CPU (better cache locality, better throughput)… so if I have enough threads spinning then each CPU might end up switching between TWO spinning threads each calling Yield… and never allowing the last guy with Real Work on any CPU.
  4. More realistically, if the box is doing large batch jobs and is otherwise busy then the last LUFact thread trying to get Real Work done is throwing in his request for CPU time along with all the other LUFact threads… and all other busy processes on this busy box.  If I’m running LUFact with 32 threads, then I have 31 spinning threads and 1 working thread… so the odds of the OS scheduling picking my one dude needing to work fall off to 1/32 (notice that the OS thinks I have 32 threads that are usefully doing work, because the OS has no clue that calling Yield means “this thread is waiting for somebody else to change something else”.

All of these reasons have made Yield a sensitive hot-button kind of call to JVM engineers… so there are lots of JVM switches you can throw to change how the JVM treats Yield.  It can be ignored (no context switch, -XX:+DontYieldALot), it can call the OS’s yield equivalent (sometimes nothing on some OS’s, sometimes forces a context switch… but no guarantees that eventually all threads will get a CPU), it can do a micro-second sleep (sometimes only after the N’th yield call, or sometimes the sleep-time ramps up with successive Yield calls).  In short, JVM engineers fumbled around with the Yield call trying to figure out how to make it behave such that the common-case usage pattern of too-many-threads calling Yield in a spin-loop would do something useful.  Right now Azul’s JVM defaults to Yield being a micro-second sleep, but I’m leaning towards it to behave like the Ethernet’s exponential backoff protocol: successive yield calls should sleep() longer and longer.  Since Yield has such a wide variety of behaviors… and flags to control behaviors and none seem to do work very well in practice:

Please, Oh God, Please do not make your concurrency algorithm depend on Thread.yield!!!

LUFact Barrier’s Other Bug:

They forgot to do any volatile operations in the spin-loops (or maybe the benchmark code predates the Java Memory Model).  All those threads spinning on when Thread zero completes?  Really they are spinning on ‘barrier[0]’ changing state… but this is a simple array load.  It’s obviously an invariant load in those spin loops…. which the JIT does the obvious thing and hoists the load out of the loop – making it an INFINITE spin loop calling Yield.  BUT WAIT!!! I see the source code and there’s a volatile keyword on the barrier array!   Alas… this makes the load of the barrier array variable itself a volatile… but not any load of any array element.  You cannot make arrays of volatile primitive elements in Java.  You can do volatile-like array loads using the Unsafe classes… but nothing from the plain bytecodes or class-file format (never mind from Java… you can’t express the intent via bytecodes or normal class-file format, so cannot do this directly for any JVM based language).

I smartened up the Azul JIT a little, and it removed the invariant load… and suddenly LUFact’s barrier spin loops spin forever now.  Bleah…. stuck between dumbing-down my JVM vs failing to run a major public benchmark.

And now for something completely different:

Disruptor

Disruptor is this nifty idea: very low latency task handoff between threads in a FIFO queue (or sorta like StreamIt’s idea of a web of pipelines).  The core algorithm involves a ring buffer holding tasks and a version number to control when things are in (or out) of the buffer.  Threads spin to insert or remove items.  In theory it’s a fast elegant way to get multiple CPUs to parcel out high-volume small units of work (imagine reading network packets and making stock-market trading decisions in nano-seconds).

What Goes Right

Disruptor is probably 10x faster than the equivalent code written in the JDK 5 BlockingQueue implementation.  That’s a lot faster. It’s a little trickier to set up the boiler plate, but there’s clearly a substantial payoff.  Another issue that didn’t happen for me:

(From the authors of Disruptor: ) “The Diamond example can, but not always, show up a serious issue with the Oracle JVM and how is cause false sharing to happen with its card marking algorithm for GC. ”

(My reply) “Yeah, we filter cardmarks between generations.  Costs (very slightly) more in the no-contention case, but saves a ton when there are frequent cross-generation writes to the same cache-lines, i.e., when you are frequently writing new objects into an old Java array.  Happens to Doug Lea also, so he talked us into filtering card marks years ago.”

What Goes Wrong

  1. The Google Code testing harness mixes both Disruptor and BlockingQueue results together in a single timing number.  If you just run ‘ant throughput:test’ you get times out… but they are the sum of Disruptor times and the alternative BlockingQueue times.  Since the BQ time is often 10x slower than the Disruptor time, the reported results are 90% due to the BQ time.  In short, the “out of the box” experience with Disruptor performance is misleading.  Note that once the tests are pulled out of the ant script it is possible to run the Disruptor and BQ tests independently, and in any case results are presented separately.  It’s only when run from the ant script in the default format do you get blended results.  Fix your ‘ant’ script guys, so we can see the apples side-by-side instead of stacked together.
  2. The throughput tests are all done as classic bad MicroBenchmarks; I rant against these all the time.  See Caliper for a better answer.  In this case I saw Bad Code from the required OSR compiles, Bad Code from using ‘long’ types for the for-loop iterators and classic 1st-run-fast-2nd-run-slow syndrome.  Profiling ended up profiling the 90% case… which was the BlockingQueue implementation.  I hacked Azul’s JVM and managed to improve our BQ performance (yeah!) but I don’t think the goal of the disruptor microbenchmarks was to help make BQ faster!  In any case I eventually figured that I was running both benchmarks together, then figured out how to run the pieces independently of each other, and finally saw Disruptor running 10x faster than a BlockingQueue implementation.  I fixed much of the Bad Code issues in Azul’s JVM to be nice, but that same Bad Code will probably never occur in any real use-case of Disruptor.  I wish they had included realistic use-cases!
  3. Disruptor runs hard-up against the limits of modern hardware… and it’s performance becomes extremely sensitive to CPU placement of the working threads.

This last point will be harder for the Disruptor guys to address.  Basically if the work producer and the work consumer land on the same server *socket* and share the same L3… then performance is Good.  If they land on different sockets and have to communicate by sending packets through the motherboard (Intel QPI links)…then performance is 1/3 of Good.  Let me repeat this: based on the vagaries of OS scheduling performance varies by a factor of 3 … and in quantum jumps at that.  (To put in context: 3x slower than Good is still more than 3x faster than any alternative solution).

I hacked their original benchmark to report performance (throughput) results once/second.  I ran Disruptor for a loooong time on a 16-way nehalem box which was otherwise idle.  Fairly rapidly the JVM hits steady state – no more JIT’ing, GC is completely stable, etc.  By watching perfbar I can see 2 CPUs working furiously away… and I can tell which 2 CPUs it is.  Performance stabilizes at N ops/sec.  Then I perturb the system, and the OS schedules the working threads on 2 different CPUs… and the system gets stable again… but with a radically different performance level.  This happened repeatedly for me, across different JVMs and different Intel hardware platforms (different generations of CPUs, different numbers of sockets).  In short, the best performance will require the OS to schedule the worker threads on the same socket but not the same CPU.  This is a hard problem to solve in general, EVEN IF you own the OS “stack” and can modify the scheduler.  Basically the OS does not realize 2 threads are “talking” to each other via the cache hierarchy and so it has no idea how to best schedule the threads on CPUs.  You get pot-luck on performance, which a variation of 3x.  The situation is only worse when the pipeline has more CPUs or a more complex communication pattern.  Blah.

While we wait for OS’s to Do The Right Thing here, Java needs some kind of Processor & Socket Affinity calls.

Almost forgot – the connection to spin-loops?  I found lots when profiling BlockingQueue (including calls to Thread.yield), along with many hot calls to Park and Unpark (now a C2 JIT compiler intrinsic).  I also found one when profiling Disruptor… but only when I first had made some kind of JIT error.  The Disruptor microbenchmarks cleanup at the run end by waiting in a spin-loop until the queues drain out… and if you make a JIT error the queues might never report as drained-out so some CPU spins forever in a hot loop and becomes very visible to a profiler.   Spin-loops are OK for Disruptor which pretty much by definition is busted if it runs without each thread having a dedicated CPU.

All my bugs are fixed now… and thus endith my week of looking at spin-loops.

Cliff