And now some Hardware Transactional Memory comments…

(sorry for the long gap between postings; work’s gotten interesting and I got busy)

 

I recently attended the Bay Area Workshop on Transactional Memory at Stanford generously hosted by HP.  Slides are here; my slides are helpfully titled “2009_TMW.pdf”.  In this workshop I gave a brief overview of Azul Systems’ Hardware Transactional Memory support plus our experiences in using it.  This was followed shortly by a blog posting from David Dice discussing Sun’s HTM support in the Rock processor.

 

For Azul Systems’ certainly (and to a lesser extent Rock), the name of the game is throughput: we appear to be generously over-provisioned with bandwidth, we can sustain 30G/sec allocation on 600G heaps with max pause times on the order of 10’s of milliseconds (and unlike IBM’s Metronome hard-real-time collector, our MMU past 20msec is 0.99; see MMU definition at bottom).  Each of our 864 cpus can sustain 2 cache-missing memory ops (plus a bunch of prefetches); a busy box will see 2300+ outstanding memory references at any time.  We have a lite microkernel style OS; we can easily handle 100K runnable threads (not just blocked ones).  Our JVM & GC scales easily to the whole box.  In short: the bottleneck is NOT the platform.  We need our users to be able to write scalable concurrent code.

 

The goal of our HTM has always been to accelerate “dusty deck” Java programs via Lock Elision.  (We’ve never been tempted to enter into the language wars and implement some form of full-blown transactional programming via an “atomic” keyword). We see a lot of defensive uses of the “synchronization” keyword; legacy libraries using “sync” on all methods or code maintainers sprinkling “sync” about liberally in order to squash some bug.  In practice, 99% of all locks are never contended – and these we run very fast (our CAS & FENCE instructions can hit in cache; an uncontended lock requires only a few clock cycles).  But the locks that are contended prevent parallel execution (Amdahl’s Law and all that)- and we observed that much of the time the contention is on the lock itself and not on the data it protects.  So we added in HTM support in our first processors and this support has been shipping and turned on at all costumer sites for over 4 years now.  By now we have a lot of field experience with using HTM.

 

But first, a brief overview of our HTM support: we use a few extra bits on each L1 cache line to track “speculatively read” and “speculatively written” cache lines.  Each CPU can be put in a “speculate” mode; loads and stores then set these bits accordingly.  If a speculative line is evicted from the L1 for any reason then the transaction “aborts” (there is also an “abort” instruction).  Speculatively-written lines are marked invalid (meaning: they no longer cache any data) and speculatively-read lines have their spec-bit cleared.  Also the CPU branches to a fixed trap vector for software fixup. 

 

If the CPU makes it to a “commit” instruction then the spec bits are cleared – including the speculative-write bits – which atomically makes all those writes visible to other CPUs. In other words, the XTN commits.

 

Nothing aborts a transaction except the “abort” op or losing a speculative line out of L1.  We do not abort on function calls/returns, nor TLB misses, nor exceptions nor rain nor snow nor sleet nor dark of night…  Our XTNs are limited therefore by the size and associativity of the 16K 4-way L1 cache (and the shared inclusive L2).  We routinely see successful HTMs of many thousands of instructions with many hundreds of cache-hitting memory ops. (contrast this to what I can figure out about Rock: Rock appears to abort on function call/return, on running out of store-buffer depth (which limits an XTN to 16 stores?), on TLB misses, on L1 associativity, on common nested locking patterns (because of abort on function calls)). 

 

Software heuristics determine when to try the HTM (uncontended locks are much cheaper using a cache-hitting CAS).  The software measures the success/fail ratio of the HTM on contented locks and determines when to flip to a standard blocking implementation and when to use the HTM. 

 

While some apps get a nice 2x speedup (e.g. Trade6), most apps are speedup only slightly (we had lots of teething problems with the software heuristic that used to cause 10-20% slowdowns due to endless fail/retry loops, although these all appear to be fixed now).  We nearly always see a handful of locks using the HTM support, but these appear to only rarely be the “right” locks to get more CPUs really busy.  HTM failure appears to nearly always be due to conflict and not capacity.

 

Issues

 

In short, users’ don’t write “TM-friendly” code.  Neither do library writers.  For example the “modcount” field in Hashtable is bumped for every update in the Hashtable.  For a large Hashtable we can expect most mods to be in unrelated portions of the Hashtable.  i.e., without the modcount field update we would expect the HTM to allow parallel ‘put’ operations.  With the modcount field update, ‘put’ operations always have a true data conflict – and this aborts the HTM.  After some number of failed ‘put’ operations we have to start really locking the Hashtable (to allow the puts to make forward progress) and this now single-threads the ‘get’ operations as well.  This general pattern of updates to peripherial shared values (usually performance counters of one sort or another) is very common and it kills the HTM – because it presents a true data conflict.

 

Many times a small rewrite to remove the conflict makes the HTM useful.  But this blows the “dusty deck” code – people just want their old code to run faster.  The hard part here is getting customers to accept that a code rewrite is needed.  Once they are over that mental hump, once a code rewrite is “on the table” – then the customers go whole-hog.  Why make the code XTN-friendly when they can make it lock-friendly as well, and have it run fine on all gear (and not just HTM-enabled gear)?  Also locks have well understood performance characteristics, unlike TM’s which generally rely on a complex and not-well-understood runtime portion (and indeed all the STMs out there have wildly varying “sweet spots” such that code which performs well on one STM might be really unusably slow on another STM). 

 

Really what the customers want to know is: “which locks do I need to ‘crack’ to get performance?”.  Once they have that answer they are ready and willing to write fine-grained locking code.  And nearly always the fine-grained locking is a very simple step up in complexity over what they had before.  It’s not the case that they need to write some uber-hard-to-maintain code to get performance.  Instead it’s the case that they have no clue which locks need to be “cracked” to get a speedup, and once that’s pointed out the fixes are generally straightforward. (e.g., replacing sync/HashMap with ConcurrentHashMap, striping a lock, reducing hold times (generally via caching), switching to AtomicXXX::increment, etc)

 

 

Summary

  • Very modest gains for HTM; usually <10% (although 2x for some large apps)
  • Rewrites of “dumb” direct contention (e.g. perf counters) would help a lot
  • Azul did this for some common JDK pieces Out There (e.g. Hashtable), but there’s just too much of it
  • Customers not willing in general to touch code for HTM
    • If code is being hacked for performance, it will be with fine-grained locking instead of simply making it “HTM friendly”. 

 

 

 

 

 

Thanks,
Cliff

 

[MMU is a common measure of GC performance and stands for “minimum mutator utilization”.  It shows the worst-case amount of time a worker thread gets for a timeslice of a given length vs time spent in GC.  Thus an MMU graph plots utilization (a number from 0 to 1 where 1 is better) versus timeslice size.  A program using Metronome might see a max GC pause of 1microsecond; it’s MMU at 1 microsecond is then zero because the mutator can do no work for that microsecond.  However the MMU at 1 second might be 70% showing that in the long run Metronome’s GC exacts a 30% performance “tax” on mutator operations.  Contrast this to Azul’s Pauseless GC: at 10msec our MMU is zero but at 1sec our MMU is 0.99; in the long run there is no GC “tax” but in the short run our max pauses are much larger than Metronome’s.  Metronome is suitable for e.g. hard-real-time applications like audio-playback, whereas Azul Systems’ is suitable for e.g. soft-real-time transactional apps with high throughput.]