Talking with Google…

[Part 2 : ‘put’ of my NonBlockingHashMap presentation will appear later]

 

I presented my NonBlockingHashMap at Google yesterday.  Overall, the talk went over very well with lots of intelligent questions.  This bodes well for presenting this stuff at JavaOne.  To be blunt, I was very concerned that I’d be talking over most folks’ heads at JavaOne, but maybe not.

 

Slides: http://www.azulsystems.com/events/javaone_2007/index.htm 
Video: http://video.google.com/videoplay?docid=2139967204534450862

 

As part of the Q&A somebody asked me about a narrow race to that happens when installing a new table.  The HashTable auto-resizes, and periodically a new table has to be installed.  During the install process many threads can all independently decide that a table-resize is required and then all go and create a new empty table and attempt to install it.  Only one succeeds in the install (via a CAS) and the rest just drop their extra tables on the floor and let GC reclaim them.  I had supposed that the extra tables created here never amounted to very much – the race seems very narrow and is clearly very rare for modest sized tables (as based on a bunch of profiling).  And I said as much during Q&A.

 

But I had observed that for high thread&cpu counts (500 to 750+) and very large tables (4million entries, times 2 words (Key,Value) times 8 bytes (64-bit VM), so a 64Mbyte array) some odd timing problems which I suspected were GC related.  So I profiled a little more… and Lo!  The Google engineer was right (sorry I didn’t get your name!)… a 64Mbyte array takes some time to create – because it has to be zero’d.  During that time more threads figure out that the table needs resizing, so they also make 64M arrays.  The sum of them start triggering various slow-path GC issues (emergency heap-growth, large object allocation space, GC decisions, etc) which gives more time for more threads to discover that a table resize is needed.  Pretty soon I’m making a few hundred 64M arrays, all of which are dead (except the one winner) and the GC is swallowing a major hiccup.

 

My fix goes against the spirit of Non-Blocking algorithms: I stall all but the first few threads making huge arrays.  The stalled threads are basically waiting to see if a new table shows up.  The algorithm is still non-blocking in the end: if the proposed new table doesn’t show up the stalled threads eventually get around to making one themselves.  But by stalling a little I nearly always avoid the problem; threads that have “promised” to make a new array are in fact busy making one – I just need to give them a little  more time.

 

This issue points out one of the interesting discrepancies between Java and C.  In C, I could malloc a few hundred 64M arrays relatively quickly: they would all end up mmap’ing virtual memory but NOT physical memory.  Then when I freed the extra’s, I get all that virtual memory back – but I’d never end up touching any physical memory.  Once I had a winner determined, I could initialize that 64M array in parallel.  In Java, I only get the pre-initialized array and it’s normally zero’d by a single thread.  It takes some substantial time to zero a 64M array.

 

I could also go with ArrayLets in Java- basically an array of arrays.  For a 4M element array I’d actually make a single 2048-element 2-d array and zero only that.  The inner arrays would get lazily created on demand.  This would let me spread out the cost of zero’ing both over time and across threads.  Biggest downside: every HashTable access would require an extra indirection (and range check) to get the real elements.

 

At the moment, I’m going to live with my sleazy stall-on-resize-of-huge-arrays.  It appears to work really well, and it’s really simple and fast.

 

Moral: Peer review is Good, and I’ll be a little less smug about saying ‘that race is really rare’ in the future!

 

Cliff

A Non-Blocking HashTable

I’ve been wrestling with concurrent algorithms again.  This time, it’s a Non-Blocking (concurrent, lock-free) HashTable.  I’ve had it basically figured out for a few months now and I’m slowing widening the circle of people I expose it too.  This HashTable seems too good to be true, which is why I’m trying hard to vet it’s correctness before claiming victory.  It’s single-threaded performance appears to equal java.util.HashTable, and at high CPU count (750 cpus on a high-end Azul machine) it’s about twice as fast as Doug Lea’s marvelous java.util.concurrent.ConcurrentHashMap – with 4096-way striping.  This is for very high read-to-write ratios (99% reads, 1% table mods).  For higher write ratios the Non-Blocking table scales much better than ConcurrentHashMap.  Of course, if you lower the 4096-way striping to the default 16-way, then the Non-Blocking table becomes easily 100x faster.  Performance summary:

  1. As fast as java.util.HashTable for single threads
  2. As fast as ConcurrentHashMap for < 32 cpus and >95% read rates
  3. Faster for higher write rates than ConcurrentHashMap.  (2x to 8x faster, depending)
  4. Faster for more CPUs.  Much faster if not striping ConcurrentHashMap enough (easily 100x faster)
  5. Scales linearly for all read/write rates at least up 750 cpus.  Tested on 4-way X86, Niagra, 32-way UltraSparc2, 384-way Azul Vega1 and a 768-way Azul Vega2.

Ok, how do I do it?

 

By writing a State-Machine based algorithm, instead of the usual concurrent programming style.

 

From this idea I get a simple state machine, a short Java implementation of it, and a very fast very scalable algorithm. 

 

First the easy stuff: I assume you have written hash tables in the past (nearly every non-Java programmer does at one point or another; the Java guys get it easy).  A hash table is a collection of {key,value} pairs with fast random access (first access is via the hash function).  I’ve chosen a power-of-2 closed table – collisions cause re-probing into the table (contrast this to an open table where collisions turn into linked-list traversals).  I chose power-of-2 instead of a prime number because ‘AND’ is faster than ‘MOD’ by an amount larger than an L1-miss-L2-hit.  I re-probe by adding 1 and AND’ing to the table size (better cache behavior for repeated re-probes).  Other design decisions make sense here, depending on your exact usage pattern.

 

Now the fun stuff: How do I handle the {key,value} pairs?  This is where the state-machine kicks in.  I define states for each interesting key & value.  For keys the states are {null,K} – where K is any key.  A key-slot makes a 1-time transition from null to some K.  For values the states are {null,V/T} – where V is any value and T is a Tombstone, a special token that is not any valid value and represents a deleted key.  A value-slot makes a 1-time transition from null to some V, and can waffle about being different V’s or T (deleted) according to whatever was the last table modification.

 

This gives me 2 key-states and 3 value-states, so the {key,value} pair has 6 states.

  • {null,null} – Empty
  • {K,null} – Partially inserted key/value
  • {K,V} – Fully functional {key,value} pair
  • {K,T} – Previously inserted, now deleted Key
  • {null,V} – partially inserted K/V pair being read out-of-order
  • {null,T} – partially inserted K/T pair being read out-of-order

 

Now here’s the cool thing: it doesn’t matter what order I read or write these bits within each pair, I always get some sane state.  I can act on that state.  This means I do not need to reason about ordering when either inserting or looking up data in the table!  This also means I do not need any memory fencing or any locking!  I do need CAS when changing the table (but not double-CAS, since I can read out-of-order I can write out-of-order as well – hence I do not need to atomically update both words).  Here, then, is the ‘get(Object key)’ code:
 

&nbsp; Object get( Object K ) {
&nbsp; &nbsp; idx = hash = K.hash();
&nbsp; &nbsp; while( true ) {
&nbsp; &nbsp;&nbsp; &nbsp;idx &amp;= (size-1);
&nbsp; &nbsp;&nbsp; &nbsp;key = table[2*idx+0]; // key/val kept in even/odd pairs
&nbsp; &nbsp;&nbsp; &nbsp;val = table[2*idx+1]; // in the large array
&nbsp; &nbsp;&nbsp; &nbsp;if( key == null ) return null;&nbsp; // a miss
&nbsp; &nbsp;&nbsp; &nbsp;if( K == key || key.equals(K) ) // key hit?
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; return val == T ? null : val; // a hit (unless deleted)
&nbsp; &nbsp;&nbsp; &nbsp;idx++;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; // reprobe
&nbsp; &nbsp; }
&nbsp; }

I’ve left out some details (e.g. using the hash to avoid doing some key-compares that are doomed to fail), but this is the gist of it.  ‘get’ is extremely lightweight – just the basic hash, lookup and return value for a hit.  More next time on how I do ‘put()’ as that is where the state machine gets a real workout.  But this post is already too long.

 

Cliff

 

PS – Slides will be online soon, and the source code as well (need to get the license figured out)

The most common data-race…

Here’s the most common data-race I see:

    if( *p != null ) {
      
// other thread does 'p=null'
      *p->xxx; 
    }

 

This is just my personal observation, both on code I’ve written and code other people have written but I’ve had to debug.  This is the most common race, by far.

 

Annoyingly, the race can be removed by an optimizing compiler – sometimes.  That is, the standard common-subexpression-elimination optimization will remove the redundant loads of ‘*p‘ and thus remove the chance of seeing a null after the 2nd load.  I say “sometimes” because optimizers are never required to optimize, it’s just nice when they do.  The problem is that with modern JVM’s you have no control over whether or not this optimization is done, and it’s likely done just when your code gets hot…. i.e., just after when a high load hits it.  Up to then, the code is likely running in an interpreted mode or in a ‘fast dumb JIT’ mode, with minimal optimizations.

 

The other problem with this code, is that you are normally solving a Large Complex Problem already – and concurrency is your tool to get the needed performance.  In order to solve the Large Complex Problem (LCP) at all you’ve added a few layers of abstraction to hide it’s complexity – which inadvertently hides the complexity of the concurrent algorithm!  Suppose ‘*p‘ hides some cached code and you have a few accessors to make what access means (in the context of LCP) more obvious:

    if( method.has_code() ) 
        // other thread does 'method.flush()'
       
method.get_code().execute();

 

Aha!  Now you (don’t) see the problem… good software engineering of the LCP has obscured the racey concurrent code.  Even worse: ‘method->flush()’ is a rare event, so this race is even rarer… meaning you’ll only crash on a heavy trading day with the VP of engineering breathing down your neck and never in the development cycle!  Here’s an even more obscure version of the same wrong code:

 

    if( !method.has_code() ) 
      method.set_code(method.compile());
    method.get_code().execute();

 

We just set the code variable, so it’s still set when we read it again to execute it – right?  Wrong!  Or maybe ‘Almost but not quite!’.

 

What’s the fix?  Acknowledging that the concurrent algorithm is just as complex as LCP (perhaps less code than the LCP but generally more subtle code).  Expose the shared variables and concurrent accesses explicitly.  Make them part of the good software engineering solution that’s going on for the LCP already.

    Code *c = method.get_code(); // Read racey shared variable ONCE
    if( !Method.has_code(c) )    // Pass in Code instead of re-reading
      c = method.set_code(method.compile());
    c.execute();                 // Execute what was read ONCE

 

Cliff

Finding Data Races

Ben Zorn and I had a long talk while working off dinner at CGO yesterday.  One of the things we noticed was how Ben’s recent work on surviving data races and Azul’s work with detecting data races in the Java Collection classes had a lot in common: both came about because data races are killing the average programmer, both run with little overhead and detect only a limited set of data-races, and are easy enough for somebody to hand-code into their code today.

 

I assume the reader knows what a data-race is (2 threads accessing the same memory location and at least one is writing), and has wrestled with debugging ones in the past.  The problem is that nobody really knows how to debug data-races. There are plenty of academic papers on how to detect them and they are all unrealistically slow, or don’t work on large real-world programs, or throw a zillion false positives, or are otherwise broken.  (Ok, the REAL problem is that we don’t know the Right Way to do concurrent programming, but that’s a talk for another blog).

 

So Ben & Azul were both looking for some easy way to detect common data-races – something people could put into practice with little or no effort, and that wouldn’t kill performance and wouldn’t raise a zillion false alarms. 

 

First Azul’s solution: we added a little code to the Java Collections.  You have to install some code on the -Xbootclasspath or run Azul’s JVM with -XX:+UseLockedCollections.  Then the newer Java 4 unlocked collections (e.g., HashMap) run locked – no data-races but no concurrency (or speed) either.  If this fixes your bug, likely you are faced with calling an unlocked collection in a racey fashion.  But where, in your million line program, are two threads calling into the same HashMap with one of them writing?  Now run with the flag -XX:+UseLockedCollections2.  This time, HashMap tracks when the collection is being both used and modified (by bumping an Atomic) – and throws a Java exception in both threads.  You get stack trace-backs for both threads involved in the data-race.  It’s a really low-cost technique, and although it only catches races in specially annotated code (e.g. HashMap) there are a few high-payoff areas in the Java libraries that seem to be involved in more than their fare share of data-races.

 

Now Ben’s solution “ToleRace“, which I am going to paraphrase badly.  The idea here is that this piece of code in front of you, that you have pored over for the last day, is known correct but some other piece of unknown code is messing up your good code by breaking the rules and accessing the data in a racey way.  Probably somebody just forgot to synchronize.

 

Turns out you can tolerate (hence “tole-race”) most races if you copy-in/copy-out around the locked region.  I.e., at the start of the synchronized region you copy-in all interesting shared variables into local copies (all those things the locked region is going to read anyways).  Then you run the critical section on the local copies.  At the end, right before you unlock, you test to see if any of the shared variables (not your local copies) changed while you held the lock.  If so, then you detected have a data-race – but only in this thread!  The guilty party as escaped (compare this to Azul’s solution where you get stack traces for both threads)!  But all is not lost… by looking at which values your locked code read and which values it wrote you can often choose to either copy-out your local variable to the original shared variable, or leave the original untouched.  By doing so you effectively can choose to act “as if” your code ran either before or after the other unknown guilty party – instead of simultaneously.  I.e., you have removed the effect of the race, or tolerated it.  Not all races can be tolerated of course, and tracking all the variables becomes really tedious for more than a handful.  Nonetheless it is possible to tolerate some kinds of data-races.

 

I’m sure Ben is using some kind of tool to auto-insert the right checks (making it rather hard for You to fixup your code right now), but perhaps you can do something similar to ToleRace or Azul’s -XX:UseLockedCollections2 if you need to hand-rescue a few synchronized blocks.

 

More on concurrent programming in another blog; it’s just too rich a topic to put down.

 

Cliff