NonBlocking HashTable Source Code

I am please to announce, after loooong delay, the source code to my NonBlocking Hash Table is available as open-source on SourceForge: 

    

http://sourceforge.net/projects/high-scale-lib

 

I’ll be adding to this library over time when I’m not so busy!  Right now I’m porting Java6 to Azul, reviewing OOPSLA papers (19 papers, about 15 pages each, mostly thick academic stuff), and making JavaOne slides.  In fact, I’ll be taking about the NonBlocking Hash Table at JavaOne (slides) this year, along with a couple of other talks.

 

Here’s an interesting (to me) discussion about tiered compilation in HotSpot.  Interesting to me because at Azul we’ve basically forked from the main HotSpot source base on this issue.  Our version of tiered compilation is alive and well, with a tidy performance win pretty much across the board.  Of course, we don’t inline true virtual calls (i.e., megamorphic calls or calls which actually target more than one method at runtime – as opposed to those that can be statically predicted), because our hardware lets us do such calls fairly cheaply.  Inlining the megamorphic call “nails down” the decision to do the Java-level call via the most expensive mechanism (albeit with compiler aided scheduling, which will help Sparc & X86 but not Azul), and nails it down at “server” compile time. 

Since Azul’s tiered compilation is not nailing down the decision to do such calls “the hard way”, if it turns out the call site is really monomorphic we get to do the call via an inline-cache, i.e., really cheap.

 

Cliff

 

PS: I struck out on Wikipedia today, failing to find entries for: megamorphic calls, inline caches, tiered JIT compilation (and several variations on that theme), as well as entries for IBM’s J9 JVM (which I know has tiered compilation).  How many readers of this blog know what an inline-cache is?  (hint: it’s a way to make >95% of virtual Java calls go nearly as fast as plain static calls).

How to stop a compiler

Suppose I want to write a microbench to test something.  Here’s a for-instance:

  // Time sqrt
  long start = System.currentTimeMillis();
  for( int i=0; i<1000000; i++ )
    Math.sqrt(i);
  long stop  = System.currentTimeMillis();
  System.out.println("sqrts/sec="
                     +1000000.0/((double)((stop-start)/1000)));

I run this and get:

  sqrts/sec=Infinity

Urr… what?  Ah!  My fast X86 can do so many sqrts in a second, that “stop-start” is less than 1000 milliseconds, then “(stop-start)/1000” rounds down to zero as integer math.  Try again:

  // Time sqrt, Round 2
  long start = System.currentTimeMillis();
  for( int i=0; i<1000000; i++ )
    Math.sqrt(i);
  long stop  = System.currentTimeMillis();
  System.out.println("sqrts/sec="
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;+1000000.0/<span style="color: rgb(204, 0, 0);">(((double)(stop-start))/1000.0)</span>);

This time I get:

&nbsp; sqrts/sec=2.5E8

Ahhh… a smug smile, that X86 is really fast.  It’s a 2.5Ghz Xeon, so that’s… lets see… (2.5e9 clks/sec) / (2.5e8 sqrts/sec) = 10 clks/sqrt.  Humm… that IS really fast.  Let’s raise the bar a little: let’s try 10 million sqrts instead of 1 million:

&nbsp; // Time sqrt, Round 3
&nbsp; long start = System.currentTimeMillis();
&nbsp; for( int i=0; i&lt;<span style="color: rgb(204, 0, 0);">10000000</span>; i++ )
&nbsp; &nbsp; Math.sqrt(i);
&nbsp; long stop&nbsp; = System.currentTimeMillis();
&nbsp; System.out.println("sqrts/sec="
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;+<span style="color: rgb(204, 0, 0);">10000000.0</span>/(((double)(stop-start))/1000.0));

This time I get:

&nbsp; sqrts/sec=4.5454545454545456E8

Yeek!  TWO alarm bells go off in my head! 

 

First Alarm: 10x more work but only about 2x more time; now I’m down to 5 clks/sqrt!

 

Second Alarm: that repeating fraction result: it tells me I’ve likely got a REALLY small number of milliseconds that I’m dividing.  In fact… it’s 22 milliseconds.  Very suspicious!  Yup, the compiler is tossing out my inner loop as being dead.  To confirm, I’ll switch to 1billion sqrts.  This time I get:

&nbsp; sqrts/sec=3.846153846153846E10

That’s roughly 15 sqrts PER CLOCK CYCLE – not 15 clks/sqrt.  Yup, that’s one speed-demon Xeon.  Or a brainiac compiler.  To defeat the compiler I need to use all the results of the inner loop in the final output.  Here’s one way (and notice I’m back to 1million iterations):

&nbsp; // Time sqrt, Round 4
&nbsp; long start = System.currentTimeMillis();
&nbsp; <span style="color: rgb(204, 0, 0);">double sum = 0.0;</span>&nbsp; 
&nbsp; for( int i=0; i&lt;1000000; i++ )
&nbsp; &nbsp; <span style="color: rgb(204, 0, 0);">sum += </span>Math.sqrt(i);&nbsp; <span style="color: rgb(204, 0, 0);">// use the results in some cheap way</span>
&nbsp; long stop&nbsp; = System.currentTimeMillis();
<span style="color: rgb(204, 0, 0);">&nbsp; if( sum == 1234567.0 )&nbsp; // final usage of ALL results<br>&nbsp; &nbsp; System.out.print("woohoo!"); </span>
&nbsp; System.out.println("sqrts/sec="
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;+1000000.0/(((double)(stop-start))/1000.0));

Now I’m measuring 1 sqrt and 1 double-precision addition per iteration… but I know the add is fairly fast, or least it’s fast relative to square root.  This time I get:

  sqrts/sec=7692307.692307692

Ahhh… about 7.7million sqrts/sec, or 325 clks/sqrt.  I know double-precision add is much less than this, so the extra add does not futz with my result numbers.  And no “woohoo!” in the output.  In fact, it’s vanishingly rare that I’ll get an exact match and get my output polluted.  And I finally got some sane numbers out. 

 

Ok, really I want to time various HashTable.get/put implementations but the basic issues are similar.  I’ve got multi-threaded scaling issues to work out as well.  Meanwhile, here’s my 2002 JavaOne slides on the topic:

http://www.azulsystems.com/events/javaone_2002/microbenchmarks.pdf

 

Cliff

 

P.S. As a result of said testing, Bill Pugh’s extra fences & ordering guarantees do not appear to cost very much, so I’m including them.  The SourceForge project is up but empty; I hope to do an initial check-in later this week.

quick update

I’ve been busy hacking JavaOne slides (lots of good feedback!), and the source (Bill Pugh wants more fences – well, really more ordering guarantees) and getting a SourceForge account setup and getting Azul to agree to a license and my home machine fried it’s harddrive (dead, dead, dead and dead and lots of wonderful chunky sounds when it powers up) and fire-fighting bugs and the usual daily overhead issues, etc, but the worst of those hurdles are behind me.  I hope to have the source up soon!

 

Slides:   http://www.azulsystems.com/events/javaone_2007/index.htm

A Non-Blocking HashTable, Part 2

In a previous blog I talked about my Non-Blocking Hash Table, and how to implement ‘get()’.  This blog will focus on ‘put()’ and all variations.  The hard part will be figuring out how to include state machine diagrams in a blog!  Quick recap: the HashTable is a closed power-of-2 sized table.  The table holds Key/Value pairs in even/odd entries in a large Java Object[] array.  Key slots are limited to States  {null,K} and make a 1-time transition from null to K (when a new Key is inserted).  Value slots are limited to States {null,V/T} where V is any Value, and T is a Tombstone.  Value slots can waffle about according to the last put(); keys are deleted by replacing the value with a Tombstone; keys can be re-inserted by replacing the Tombstone with a real Value again. 

 

Two Key states times three Value states gives me 6 total states:

  1. {null,null} – Empty
  2. {K,null} – Partially inserted key/value
  3. {K,V} – Fully functional {key,value} pair
  4. {K,T} – Previously inserted, now deleted Key
  5. {null,V} – partially inserted K/V pair being read out-of-order
  6. {null,T} – partially inserted K/T pair being read out-of-order

States 5 & 6 are functionally identical, and are only visible to threads doing a get() call where they prefetch the Value slot despite seeing a null Key.  The null key counts as a miss for the get(); the stuff in the Value slot belongs to some not-quite-fully-inserted Key/Value pair – but the get()-thread does not know for which Key!

 

Note that Key slots, once set, can never be UNset – this is required for correctness, lest Thread A tests for K1 in a slot, then Thread B deletes it & inserts K2/V2, and finally Thread A gets around to reading the value slot and reads V2 – and not the now replaced V1.

put() can be broken down into 2 main steps: key-slot claim and value update. 

 

Key-slot Claim:  First up: standard hash, mask-to-table-size, then look at the slot.  If the slot has a Key already and it’s the desired Key – we’re done.  If it’s the wrong key – then just like get() we reprobe.  If we find a null Key slot, then we attempt to CAS the slot from null to Key.  If we succeed, then we’re done.  If not, we reprobe.  If we reprobe too often, we’ll trigger a resize and that’s a subject for a later blog.

&nbsp; Object put( Object K, Object V ) {
&nbsp; &nbsp; idx = hash = K.hash();
&nbsp; &nbsp; while( true ) {
&nbsp; &nbsp;&nbsp; &nbsp;idx &amp;= (size-1);&nbsp; &nbsp;&nbsp; &nbsp;// mask to table size
&nbsp; &nbsp;&nbsp; &nbsp;key = table[2*idx+0]; // key/val kept in even/odd pairs
&nbsp; &nbsp;&nbsp; &nbsp;if( K == key || key.equals(K) ) // key hit?
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; break;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; // Found desired Key slot
&nbsp; &nbsp;&nbsp; &nbsp;if( key == null &amp;&amp;&nbsp; &nbsp; // Found end of key chain 
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; CAS_key(table,idx,null,K) ) // try to claim slot
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; break;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; // CAS worked!&nbsp; We own slot
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; // If the CAS_key fails, then the slot got taken 
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; // by somebody else...
&nbsp; &nbsp;&nbsp; &nbsp;idx++;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; // reprobe
&nbsp; &nbsp;&nbsp; &nbsp;if( idx &gt; limit ) 
&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; return resize_size_next_blog();
&nbsp; &nbsp; }
&nbsp; }

What, exactly, is CAS_key() doing?  It’s implemented via sun.misc.Unsafe.WeakCompareAndSwap().  I’m using the Weak version because I do not need a fence, and at least on Azul the fence costs (roughly the cost of a trip to memory).  On most other platforms the CAS includes an unavoidable fence, hence the CAS includes some unavoidable cost.  The weak version of CAS allows for spurious failure; so my CAS_key code will loop if it thinks the failure is spurious.  i.e., it loops until it succeeds or the value in memory no longer matches the expected value.

 

Value-slot Update: The output of the prior step is that the Key-slot is correct, but the Value-slot is really unknown.  You might think that after a fresh key-claim you could be assured that the value slot is null.  But another thread can leap in and change the null at any time.  Hence value-update has to inspect & go with what it finds there.  As before, I’m using an unfenced CAS that does not spuriously fail (internal looping on spurious failure).

&nbsp; Object old = table[2*idx+1];&nbsp; // key/val kept in even/odd pairs
&nbsp; if( CAS_val(table,idx,old,V) )// Attempt CAS on val slot
&nbsp; &nbsp; return old;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; // Success!&nbsp; Return old value

If the CAS fails, then somebody else must have published another value out from under us.  We can either give up OR retry: if we give up we can claim it is “as if” our CAS succeeded but another thread immediately stomped our update.  The problem with this approach is that we can’t make the OTHER update return OUR value as it’s “old value”.    i.e., if we had truly succeeded and been immediately stomped, the stomper would be returning our value as it’s “old value”.  So instead we’ll retry:

<span style="color: rgb(153, 0, 0);">&nbsp; while( true ) {</span>
&nbsp; &nbsp; Object old = table[2*idx+1];&nbsp; // key/val kept in even/odd pairs
&nbsp; &nbsp; if( CAS_val(table,idx,old,V) )// Attempt CAS on val slot
&nbsp; &nbsp;&nbsp; &nbsp;return old;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; // Success!&nbsp; Return old value
 <span style="color: rgb(204, 0, 0);"> }</span>

What about all those put() variations?  putIfAbsent?  remove?  replace?  Turns out, it’s all the same implementation in the end – we just need to gate the CAS attempt a bit more.  We’ll pass in an extra value which has to match the old value, or the extra value is some sentinel meaning “dont care”.

&nbsp; while( true ) {
&nbsp; &nbsp; Object old = table[2*idx+1];&nbsp; // key/val kept in even/odd pairs
<span style="color: rgb(204, 0, 0);">&nbsp; &nbsp; if( extra != null &amp;&amp; extra != old ) // Extra condition?&nbsp; not met?<br>&nbsp; &nbsp;&nbsp; &nbsp;return null;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp; // Then the put attempt failed!<br></span>&nbsp; &nbsp; if( CAS_val(table,idx,old,V) )// Attempt CAS on val slot
&nbsp; &nbsp;&nbsp; &nbsp;return old;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; &nbsp;&nbsp; // Success!&nbsp; Return old value
&nbsp; }

 

  • remove: New value is really a Tombstone.
  • putIfAbsent: Extra value is a Tombstone.
  • replace: Extra value is the expected oldValue

 

and so forth.  There are only a few more tricks to go in here: things like ‘remove()’ should not insert a missing key, only to turn around and insert a tombstone.  ‘replace()’ gives up on seeing a tombstone unlike ‘putifAbsent()’ which only succeeds on seeing a tombstone (or null).  The returned old value has to have tombstones mapped back to null (the normal “no old value” return result).

Stayed for next posting, where I try to explain resize() in a blog… and probably punt to a real white-paper style presentation.

 

Cliff

state machines & non-blocking algs

Chris Purcell writes:

Re. your talk: State machines permeate my algorithms, too. I suspect it’s because they are easy to do atomically, make the “progress” through an algorithm obvious (simplifying concurrent assistance), and make enumerating all cases (the bane of lock-free algorithms) both explicit and fundamental.

This is absolutely the Right Thing To Do here.  The hardware guys have been using State Machines for years to do concurrent algorithms – in hardware.  That is, they have tool support such that they can write large State Machines, have the description maintained by many people (CVS for State Machines?), automatically tested, executable code generated (Verilog), etc.  I think the concurrent algorithms crowd definitely needs to head down this road as well.

 

Cliff