Clojure: STMs vs Locks

I’ve been participating in this fascinating discussion with Rich Hickey, the author of Clojure about Software Transactional Memory.  I decided to cleanup and echo the conversation here, but the original can be found here.


The Problem Statement: it’s not “atomic-vs-lock” but instead it’s “where do we put atomic/lock?”

Randall R Schulz

  I found the following comment on an Azul Systems blog (http://blogs.azulsystems.com/cliff/2008/05/javaone.html) :

 

Cliff Click wrote:   

    Performance in an STM is not transparent: it requires a runtime to do abort/retry, and as the runtimes get fancy & complex the behavior under load of STM’s gets…. weird and unpredictable.  Not something you can put into a production environment.   

  Any comments or counter-examples to this?

Raoul Duke

Rich Hickey, richhic…@gmail.com

 

wrote:

  Are explicit locking designs free from such anomalies? I wouldn’t think so.

paraphrase: the behaviour under load gets weird and tricky.

 

well, in some ways maybe that doesn’t happen with locks: the question of (depending on which STM system we’re looking at) figuring out what caused all the rollbacks vs. with locks, you can at least generally quickly see that a given lock has 90% of all threads waiting on it kind of thing. whereas you don’t know what variable in the venn diagram of overlapping transactions caused the retry?

 

tho i believe Rich previously said that Clojure would actually have a way of finding that out!
http://groups.google.com/group/clojure/browse_thread/thread/2f73b80fd

 

Cliff Click

You got it – under load, performance is unpredictable.

 

With locks, you can see which locks are loaded, generally figure out why, and plan a strategy (stripe the lock, shorter hold time, switch to the j.u.concurrent datastructures, etc).

 

Not so (at least not yet) with STM.  I’ve talked to a few people who’ve tried STM out in a larger scale than the usual academic papers – and the results are downright disappointing.  Unless you become an expert in the STM runtime you’re using (where “expert” means “not just grok it, but can build & tweak it”) anytime performance is not good enough – you get stuck.  This is an open research problem at best, with no clear indication that the problem can be solved at all.

 

Rich Hickey

One could as generically argue that systems with manual locking are usually broken, and therefore their behavior itself, never mind their performance, is unpredictable.  That’s been my experience with manual locking in the hands of mere mortal developers.  It can be as difficult to understand the behavior of any single manually-concurrent program as to understand the performance characteristics of an STM.  In the latter case, at least the STM is handling correctness for you, and all users of the same STM can share knowledge (and bug fixes and performance improvements), vs each application being its own Gordian knot.  And in any case in which the STM is failing you performance-wise, you can opt out and use manual locks outside of transactions.  To the extent you can use it, STM can drastically reduce the complexity of a system.

 

I’m wary of unifying the discussion of concurrency with one about performance, as the problems of concurrency start with correctness, where the manual locking story, due to its complexity, is quite bad.  Scalability is not a universal problem – some systems need to handle thousands of simultaneous connections – most don’t, some need to leverage hundreds of CPUs in a single application – most don’t.  But every application that wants to leverage multiple cores needs to be correct.

 

It would be no problem to instrument the Clojure STM references with fault counters that would allow one to know the exact per-reference contention level. Once known, the answers in both cases are similar – share less mutable state, don’t have both long and short-lived transactions contend for the same data, check your success conditions ASAP etc.

 

STM designs differ from one another quite a bit, so any general statements are difficult to assess.  I think the level of granularity of the STM comes into play.  Most of the STM papers I’ve read concentrate on creating lots of transactional cells with which they envision constructing concurrent data structures like the ones in java.util.concurrent.  I don’t think that’s a good idea at all.

 

Clojure encourages the use of immutable persistent data structures, which need no locking or coordination whatsoever in order to perform correctly in a concurrent situation, and STM references to support the appearance of identity-preserving change.  It also encourages the use of java.util.concurrent and the like for caches and work queues – the best tools for the job.

 

As far as I know, Clojure is the first STM that uses snapshot MVCC, avoiding both read logging and transaction restarts due to read invalidations.  Clojure’s STM also gives you the ability to ‘ensure’ a reference you later intend to read or write, thus providing some manual control over resource acquisition order (rather than just relying on the side effect of access).  It also supports explicit commute, further reducing retries for commutative writes.  I’ll not claim its performance characteristics are at all proven, but its contention due to, and for, reading is lower than in programs that use manual locking, and lower than in STMs in which write transactions can cause read transactions to retry.

 

Once you get to record-level STM granularity, like Clojure’s, it’s also a bit easier to draw parallels to the performance characteristics of the database systems it mimics, which have been doing similar things for decades.

 

I don’t consider STM performance any more or less of a research problem than determining the correctness of programs that do manual locking – there’s still work to do.

 

And of course, neither STM, nor any other mechanism, is a panacea.

Brett Morgan

Given that I am a mere developer, actually with MT i’d consider myself a rank newbie, the most important thing to me is visibility.  I want to be able to see what problems I am creating with my design, so that I can change my design, and see how the problems change.  In effect, I am looking for an environment that teaches me what I am doing wrong, if only by showing me which references are hotly contended.

 

In keeping with that, it would probably be helpful for both refs and transactions to be namable, such that debugging output from the STM runtime can actually be easily inspected.  The reason I say this is that the data I need to work over is non uniform, so I need to be able to label the hot reference to know which particular chunks of my data set are causing issues.

 

thoughts?

Cliff Click

The problem with manual locking – and it applies equally well to transactions – is that the there’s no name-space-control over what needs to be locked/transacted.  The failure people have with locks (and limited experience with transactions) is that they can’t properly name their shared variables.  They think they can, until we start looking hard, and then we keep finding more… and more that need to be locked/transacted atomically.  In short, people don’t know where to put the lock/unlock or where to bound the transactional region.

 

Having said that, locks have some advantages over transactions: transparent performance being one of them.  Programs are busted if the locks are wrong (or transaction regions don’t cover all the variables they need to; Clojure isn’t immune here) AND they are busted if performance sucks.  For 2-4 way machines the issue is probably moot; for 8-32 way machines (standard servers these days) – it matters.  In a year every el-cheapo linux blade server out there will be looking at 16-64 cpus.  And of course for Azul, 100+ active cpus is the norm.  I claim that in a year, most people will either care (get bit by lack of scaling), or see that their need to care within another year.

 

Requiring shared variables to have a special syntax, ala Ref, is a big step forward here for both locks & transactions.  You’re reducing the complexity of the problem via name-space control (only Ref‘s are shared) and promoting immutable shared state.  The STM issue is orthogonal.

 

As evidence of that, suppose you replace your STM implementation with a trivial single global lock.  Are the programs more or less complex than their locking counterparts?  The program is as-correct and as-complex as before (before you swap the STM implementation), and indeed is exactly equal to the locking solution using a single global lock.  Suppose I can get a correct program using STM – then I also have a correct program using a single global lock.  STM only varies from locking as I try to make the implementation more performant: for the STM I trust the language implementor to “stripe” the locks under the  hood.  For locks, I start using more than one lock.  In both cases I can also try to shrink the lock-hold-time (or reduce the size of the STM region), or switch algorithms.

 

Summary: STM doesn’t solve the Big Problem (lack of name-space control).  Ref‘s might solve the Big Problem – and solving the Big Problem will be a big help to both locks & STM.  Locks have transparent performance and are thus easy to optimize.  Optimizing without good name-space control is a buggy process for both locks & STM.  I don’t know how to optimize an STM implementation without becoming an expert in the particulars of the STM runtime (and there are so many to choose from!)

Rich Hickey

I agree that identifying the things that can change is one key issue, and Clojure’s Refs, Agents, and Vars all do that.  But I think a large number of your arguments reflect a lack of understanding of how (at least Clojure’s) STM works.

 

Refs are not merely markers, they do enforcement.  There is simply no mutating a ref outside of a transaction – it’s not an allowed mistake.  Any attempt to change a ref outside a transaction throws an exception.  So, there is no such thing as “transaction regions don’t cover all the variables they need to” in Clojure.

 

Your thought exercise with a single lock is a good one.  Let’s say Clojure’s implementation of STM used a single global lock (in fact it has no global locks at all), and contrast it with a manual locking program that used a single lock for everything.

 

Which is more correct?  The Clojure one, because it is not subject to undetected ‘forgetting to obtain the lock’.  Your assertion that “Suppose I can get a correct program using STM – then I also have a correct program using a single global lock” implies nothing about programs written with a single manual lock where no STM system is helping ensure correctness.  All Clojure STM programs (that don’t throw Ref mutation out-of-transaction exceptions) are correct with a global lock, but not all programs using a global lock (that don’t throw exceptions) are correct STM (or otherwise) programs.

 

Which one is has better performance? Constant factors aside, potentially the Clojure one, because readers do not wait for writers in MVCC.

 

As we add locks, performance of both approaches improves, but a new source of errors comes into play for the manual approach – lock acquisition order and the resultant deadlock risk.  At this point STM becomes dramatically less complex than manual locking.

 

I’ll grant that it is likely that, in the hands of experts, it is possible to write a correct program with a custom crafted locking strategy that will outperform any general-purpose STM strategy.  But I still contend that, for any moderately complex program, that strategy will approach the complexity of an STM, and that most people will get it wrong in subtle, intractable ways.

 

Analogies to databases are telling.  DBMS’s took over locking long ago and most people were happy to have them do so.  Ditto memory management and GC.  In each case, people claimed superior control and performance in the manual solution, but the cost and complexity proved too high for most applications.

 

So, mutable entity identification is a problem, but it is a subproblem of correctness enforcement.  It is likely that, given well-identified sharable entities, some form of static correctness analysis might be brought to bear on the manual locking case, and I’m sure people are working on that.  STMs can provide correctness enforcement today, and therefore the two are not equivalent.  Any opacity in STM operation or performance is an area for improvement, not necessarily a reason for avoidance.

Cliff Click

On May 24, 10:51 am, Rich Hickey

wrote:

I agree that identifying the things that can change is one key issue, and Clojure’s Refs, Agents, and Vars all do that.  But I think a large number of your arguments reflect a lack of understanding of how (at least Clojure’s) STM works. Refs are not merely markers, they do enforcement.  There is simply no mutating a ref outside of a transaction – it’s not an allowed mistake.  Any attempt to change a ref outside a transaction throws an exception.  So, there is no such thing as “transaction regions don’t cover all the variables they need to” in Clojure.

Alas – there still is.   🙁
Example down below, and this is my primary complaint against both locks & STM.  However, your other arguments are more quickly answered – mostly because I agree with you.

Your thought exercise with a single lock is a good one.  Let’s say Clojure’s implementation of STM used a single global lock (in fact it has no global locks at all), and contrast it with a manual locking program that used a single lock for everything.

Which is more correct? The Clojure one, because it is not subject to undetected ‘forgetting to obtain the lock’. Your assertion that “Suppose I can get a correct program using STM – then I also have a correct program using a single global lock” implies nothing about programs written with a single manual lock where no STM system is helping ensure correctness.  All Clojure STM programs (that don’t throw Ref mutation out-of-transaction exceptions) are correct with a global lock, but not all programs using a global lock (that don’t throw exceptions) are correct STM (or otherwise) programs.

Good one.  I agree.  Clojure wins this round.

Which one is has better performance?  Constant factors aside, potentially the Clojure one, because readers do not wait for writers in MVCC.

Not really relevant; nobody writes the ‘single global lock’ version, including Clojure.  It was just a thought exercise.

As we add locks, performance of both approaches improves, but a new source of errors comes into play for the manual approach – lock acquisition order and the resultant deadlock risk.  At this point STM becomes dramatically less complex than manual locking.

Not interesting in practice.  Because in practice, deadlocks are easy to find & fix.  You get a crash dump, crawl the stacks, discover the lock cycle & reorganize.  Sure the situation could be better, but deadlocks are a ‘crash once’ kind of bug.  Dev’s don’t like ’em, but they don’t lose sleep over them either.

I’ll grant that it is likely that, in the hands of experts, it is possible to write a correct program with a custom crafted locking strategy that will outperform any general-purpose STM strategy.  But I still contend that, for any moderately complex program, that strategy will approach the complexity of an STM, and that most people will get it wrong in subtle, intractable ways.

Analogies to databases are telling.  DBMS’s took over locking long ago and most people were happy to have them do so.  Ditto memory management and GC.  In each case, people claimed superior control and performance in the manual solution, but the cost and complexity proved too high for most applications.

Actually, I was thinking of the compiler-vs-ASM analogy that took place in the 60’s & 70’s.  But point well taken: I agree that we need some kind of support here.  Maybe in the very long run STM runtimes get better & STM does win.  Meanwhile life sucks.  If you can only solve the problem with STM’s I’ll point out below then I’ll agree with you, and maybe even try my hand at an STM implementation.

So, mutable entity identification is a problem, but it is a subproblem of correctness enforcement. It is likely that, given well-identified sharable entities, some form of static correctness analysis might be brought to bear on the manual locking case, and I’m sure people are working on that. STMs can provide correctness enforcement today, and therefore the two are not equivalent. Any opacity in STM operation or performance is an area for improvement, not necessarily a reason for avoidance.

Ok, the long sought after counter example: STM’s do not compose w/correctness. Bad Java syntax follows, because I don’t ‘speak’ Clojure.  I’m using the classic example.

    class Account {
      long _money;
      add( long m ) { atomic { _money = _money + m; } }
    }
    
    transfer( Account a1, Account a2, long m ) {
      a1.add( m);
      a2.add(-m);
    }

While add alone is STM’d & safe, the two calls to add that need to be wrapped in an atomic are not – so there is a race where the accounts do not balance.  Deciding at what level to add the next atomic (should it be around transfer or at a higher level or not at all?) demands the programmer have domain knowledge not already encapsulated in the Account class.  THIS is the source of vast numbers of bugs.

Less trivial examples quickly spiral out of control.  Sure I know Ref‘s A, B & C are all Ref‘s and thus only updated in dosync blocks – but can I split any collection of Ref updates into more than one atomic region?  The ad-absurdum answer is to wrap a dosync around all Ref‘s – which means the whole program, and that means a single-threaded program.  Imagine the program unrolled in front of you, as an endless sequence of actions (all control flow flattened out)… interspersed with other actions are updates to Ref‘s.  Now: dissect this list into groups with atomic or lock as you see fit, preserving the program meaning as another unrolled thread of execution is interspersed.  How do you decide where it’s OK to split the stream and where it’s not OK? Why, in the Grand Scheme of things is:

  "... { read _money ... write money } ... " 

a correct splitting of the action stream, while

  "... { read _money ... write _money }  ... { read _money ... write _money } ..." 

is broken, but

  "... {{ read _money ... write _money } ... { read _money ... write _money }} ..." 

is correct again?


A Brief Digression into the Usefulness of Deadlock Avoidance

Phil Jordan

I’m new to STM, I’ve stumbled into it after doing some explicit/low-level lock-free programming and systems that are synchronised classically with mutex/semaphore-based locking.  I especially don’t know what goes on under the hood in Clojure’s STM implementation, or how powerful the JVM is in terms of memory guarantees.

I’m chipping in out of interest and to improve my understanding, apologies if I sound like an idiot:

Cliff Click wrote:

  On May 24, 10:51 am, Rich Hickey

wrote:   

    As we add locks, performance of both approaches improves, but a new source of errors comes into play for the manual approach – lock  acquisition order and the resultant deadlock risk. At this point STM becomes dramatically less complex than manual locking.   

  Not interesting *in practice*.  Because in practice, deadlocks are easy to find & fix.

From personal experience, this is far from the truth in complex systems.  Deadlocks happening only on “in the wild” systems, appearing in the form of heisenbugs, etc.  Not fun at all.  There’s too much in the way of implicit contracts going on, which blows up in your face if you’re trying to extend undocumented software that was written by someone who left the company before you arrived.  (and we all know the “well then document it” approach just doesn’t happen in practice)

Maybe it’s just the implementations I’ve used (pthreads, Win32, OpenMP) and others give you higher-level diagnostics, but at the level I was working it could get very painful.

You get a crash dump, crawl the stacks, discover the lock cycle & reorganize. Sure the situation could be better, but deadlocks are a ‘crash once’ kind of bug.

You need a reasonable amount of infrastructure in place for that, plus you’re relying on absence of evidence rather than proof that it can’t deadlock.

Dev’s don’t like ’em, but they don’t lose sleep over them either.

The people who lose sleep over software quality are probably the kind who try to avoid complex locking schemes like the plague in the first place. 🙂

  Bad Java syntax follows, because I don’t ‘speak’ Clojure.  I’m using the classic example.

    class Account {
      long _money;
      add( long m ) { atomic { _money = _money + m; } }
    }
    
    transfer( Account a1, Account a2, long m ) {
      a1.add( m);
      a2.add(-m);
    }

While add alone is STM’d & safe, the two calls to add that need to be wrapped in an atomic are not – so there is a race where the accounts do not balance.  Deciding at what level to add the next atomic (should it be around transfer or at a higher level or not at all?) demands the programmer have domain knowledge not already encapsulated in the Account class.  THIS is the source of vast numbers of bugs.

My understanding is that this is exactly the kind of situation where STM excels: you wrap the two add calls in a transaction rather than making them individually atomic. The way Clojure handles this (I’ve been spending 99.9% of my time in Clojure on non-threaded things, so I could easily have missed something) is that your _money would be a ref, and any attempt at modifying it will fail unless you’re in a transaction.  Wrapping the add around the transaction would be the anti-pattern here, you want to make the transfer a transaction.

Less trivial examples quickly spiral out of control.  …

Okay, you kind of lost me with what you’re trying to say here.

I get the impression you’re mixing up atomic operations on the low level with a high-level STM implementation like Clojure’s.  The latter presumably won’t be efficient unless the implementation uses the former, but my understanding is that the programmer using an STM implementation is largely supposed to be isolated from such details, as long as he/she is able to determine what operations should be wrapped in a single transaction.

Cliff Click

I’m new to STM, I’ve stumbled into it after doing some explicit/low-level lock-free programming and systems that are synchronised classically with mutex/semaphore-based locking.  I especially don’t know what goes on under the hood in Clojure’s STM implementation, or how powerful the JVM is in terms of memory guarantees.

I’m chipping in out of interest and to improve my understanding,

Let’s see if I can do you some justice…

  From personal experience, this is far from the truth in complex systems.  Deadlocks happening only on “in the wild” systems, appearing in the form of heisenbugs, etc.  Not fun at all.  There’s too much in the way of implicit contracts going on, which blows up in your face if you’re trying to extend undocumented software that was written by someone who left the company before you arrived.

Yup.  So the deadlock happened.  Ouch.

  (and we all know the “well then document it” approach just doesn’t happen in practice)

Yup, but You could make a difference.  HotSpot probably has the highest level of ‘asserts’ per line of code (and a pretty darned high level of comments per line of code) of anything Out There.  Docs are cheaper than debugging.  But it’s an aside…. back to deadlocks-in-practice:

  Maybe it’s just the implementations I’ve used (pthreads, Win32, OpenMP) and others give you higher-level diagnostics, but at the level I was working it could get very painful.   

    You get a crash dump, crawl the stacks, discover the lock cycle & reorganize.   

  You need a reasonable amount of infrastructure in place for that,

Crash dump?  Core file?  ‘ctrl-backslash’ to a JVM to get a thread-dump?
This stuff is commonly available.

If you don’t have a proper tool chain, then Job One for you should be to go get one – and I’m very serious here.  Drop everything else until you get a functional ‘path’ (protocol, implementation, business process, etc) that allows you do debug a crash from a customer in the field.  You might need a willing beta-partner for this, hand him a broken program & let it crash, figure out he needs disk-space to capture the core & logs, needs training to hit ‘ctrl-backslash’ when suspecting a deadlock, etc – plus FTP the files back to Home Base, plus you need to be able to capture the files per-customer-per-bug (ala Bugzilla), then decrypt the file (gdb for core, eyeballs or a little perl for stack-trace logs), etc, etc, etc, ….  No Rocket Science, but annoying bits of Engineering along with some Customer Management.

  plus you’re relying on absence of evidence rather than proof that it can’t deadlock.

Err…. No.

I’m not shooting for perfection here; I’m shooting for “good enough to ship”.  Which means that if – In Practice – each possible deadlock happens once, then gets fixed – that’s almost always Good Enough.  Maybe not to run a nuclear reactor, but definitely good enough to run large businesses.  In practice, most possible deadlocks never happen – but a few that do require several go-’rounds before getting fixed properly.  And then the deadlock is fixed, and remains fixed.

 

    Dev’s don’t like ’em, but they don’t lose sleep over them either.   

  The people who lose sleep over software quality are probably the kind who try to avoid complex locking schemes like the plague in the first place. 🙂

Yup… and those folks are generally stuck anyways.  But if they are thinking about the problem ahead of time they are going to be way way ahead in the long run.

  My understanding is that this is exactly the kind of situation where STM excels: you wrap the two add calls in a transaction rather than making them individually atomic.  The way Clojure handles this (I’ve been spending 99.9% of my time in Clojure on non-threaded things, so I could easily have missed something) is that your _money would be a Ref, and any attempt at modifying it will fail unless you’re in a transaction.  Wrapping the add around the transaction would be the anti-pattern here, you want to make the transfer a transaction.   

…   

Okay, you kind of lost me with what you’re trying to say here.


I Restate My Argument

Cliff Click

Trivial examples are…. trivial.  They can be fixed in a thousand obvious ways.  We need to extrapolate to the non-trivial, because that’s the only place where the STM-vs-Locks argument becomes interesting.  So lets pretend that instead of a single Ref _money and two classes, I’ve got 500 Ref‘s and a million lines of code – ala HotSpot (Sun’s JVM).  Easily >500 shared concurrent variables, about 750KLOC.  About ~100 unique locks (some are classes of striped locks) guarding them in very complex locking patterns.  Now replace all those locks with an STM & atomic.  Is my program any more correct?  Not really…. …I might have avoided some potential deadlocks (HotSpot uses lock ranking asserts to avoid deadlock; deadlock rarely happens at the engineers desk and maybe once/year in the field across all HS users).  The set of deadlocks-in-the-field avoided was miniscule.  I’ll concede that HotSpot is a rarely-well-engineered piece of concurrent code, and that deadlocks-in-the-field appear to happen more often to other large programs.  But still, fixing after the fact is reasonable when the deadlock rate is so low and each fix ‘sticks’.

Instead of deadlock, HS crashes far far more often because the locks don’t cover the right set of changes.  Switching out the locks for an STM didn’t change what I was guarding; it only removed any fine-grained-lock errors (admittedly useful… but only incrementally more so than avoiding deadlocks).  I’m still stuck with a program that’s too Big to see where the proper atomic/STM/locking boundaries need to be.  In a trivial example I can say “go up one call level and atomic there”, but in the Real Program – I can’t do that.  Go up how many layers and add atomic?  1 layer?  10 layers?  100 layers?  Yes, I see unique call-stacks with >100 layers.  I can’t put atomic around main because that makes my program single-threaded.

Here’s where I want some kind of compiler support.  Ref helps – because it at least demands I have an atomic around the Ref.  But Ref is insufficient, because a syntactally correct program simply wraps an atomic around each Ref – exact what my trivial example did.  I’d like to be able to specify classes & groupings of Refs that have to be wrapped in atomic – that the compiler will complain about.  Yes I’m still responsible for the semantics – I have to tell the compiler which groupings of Refs are interesting – but I’d like some kind of automatic support, so that as my program goes from 10 lines to 10MLOC I can be told “you touched both Ref Person._checking._money and Ref Person._savings._money without wrapping both Ref accesses in a single atomic, thats a datarace error“.

Rich Hickey

On May 25, 11:04 am, cliffc

wrote:

Not interesting *in practice*.  Because in practice, deadlocks are easy to find & fix.

I think you’ll find plenty of dissent on that point.

Deciding at what level to add the next atomic (should it be around transfer or at a higher level or not at all?) demands the programmer have domain knowledge not already encapsulated in the Account class.  THIS is the source of vast numbers of bugs.

That’s a problem with the Account class and OOP – classes are simply the wrong granularity for so many program semantics.  Yet, programmers demonstrate they can handle this, and set appropriate transaction boundaries, because they do so in their database work.  There is a pretty close analogy there, since many DBMSs will treat every statement in a batch as an independent transaction unless collected in an explicit transaction.  In spite of the same risks, work precisely like this is getting done correctly, and relatively easily, I think in no small part due to the simplicity of wrapping a bunch of arbitrary statements in BEGIN TRANSACTION.

Less trivial examples quickly spiral out of control….

What you are asking here is a philosophical question, which computers may never answer, since the answer depends on the semantics of the program, and such semantics will probably never be conveyed to the system more easily than:

  (defn transfer [a1 a2 m]
    (dosync
      (add a1 m)
      (sub a2 m)))

or the Java-esque:

  transfer( Account a1, Account a2, long m ) {
     atomic{
       a1.add( m);
       a2.add(-m);
     }
   }

Absent some similar declaration of relatedness, the program will have to understand itself (or one program another), since while it may look like debit/credit to us, it just looks like foo/bar to it.

I hope you don’t wait for an answer to this (hard AI, potentially impossible) problem before dipping your toes in STM – it’s likely you could make a significant contribution.

Rich Hickey

Towards that end, and more generally, it would be nice if there were metrics for success other than anecdotes from the field.  There should be some meaningful problem statement STM and other solutions could take aim at, something more real-world than the typical STM paper’s correctness test or Erlang‘s ‘pass messages around in a circle’ and ‘accept connections until you die’ benchmarks, and more attainable for a new technology than the million-line systems Dr. Click and Brian Goetz mentioned in their final Java One talk, or Erlang‘s nine-9s phone switches.  Then everyone could run proposed solutions on their own 2/16/multi-hundred-processor systems and say – nope, not quite, this works, this doesn’t.

Unfortunately, describing such a problem in an architecture-neutral way is difficult.  I think there can be a certain presumption that any concurrency solution should allow people to architect systems much as they do now, as a graph of directly-connected mutable stateful objects.  Dropping STM (and many of the other concurrency mechanisms) into such an architecture is likely to forever disappoint.  IMO, if you’re not serious about immutability, you’re likely to struggle with concurrency.

On my side, points were taken re: transparency and control for STMs.  The Clojure STM’s current architecture is certainly capable of both better reporting and control.  So I’ll be working on adding the necessary diagnostic counters, and exposing the many aspects of the Clojure STM that could become ‘knobs’ for performance tuning – timeout windows, retry counts, history cache sizes etc.  So, I have some work to do before I’d like that kind of scrutiny 🙂

Cliff Click

On my side, points were taken re: transparency and control for STMs.  The Clojure STM’s current architecture is certainly capable of both better reporting and control.  So I’ll be working on adding the necessary diagnostic counters, and exposing the many aspects of the Clojure STM that could become ‘knobs’ for performance tuning – timeout windows, retry counts, history cache sizes etc.  So, I have some work to do before I’d like that kind of scrutiny 🙂

I’m going to be a sourpuss again here.   🙁
This is exactly the trap MPI fell into; and you have to do it anyways.   Double-unsmiley.   🙁   🙁 

Here’s the deal:

     

  • I write a Large Complex Program, one that Really Needs STM to get it right.   
  • But performance sucks.   
  • So I do a deep dive into the STM runtime, and discover it has warts.   
  • So I hack my code to work around the warts in the STM.   
  • Crap like: at an average of 5 cpus in this atomic block the STM ‘works’, but at an average of 7 cpus in the same atomic block I get a continous fail/retry rate that’s so bad I might as well not bother.  So I guard the STM area with a “5-at-a-time” lock and a queue of threads waiting to enter.  Bleah (been there; done that – for a DB not an STM but same-in-priniciple situation).  A thousand thousand variations of the same crap happens, each requiring a different hack to my code to make it performant.   
  • Meanwhile the STM author (You: Rich) hacks out some warts & hands me a beta version.   
  • I hack my code to match the STM’s new behavior, and discover some new warts.

Back & Forth we go – and suddenly: my app’s “performance correctness” is intimately tied to the exact STM implementation.  Any change to the STM’s behavior kills my performance – and you, Rich, have learned a lot about the building of a robust good STM.  You (now) know the mistakes you made and know it’s time to restart the STM implementation from scratch.

My options limited:

     

  1. Scream at you to Not Change A Thing.  Thus C/C++/Clojure Standards Committees are born.  1/2 smiley 😛   
  2. Cling tenaciously to the abandoned version, and recognize my code is now abandon’d ware.  No new features or support for me.  But maybe the App is running fine and I can forget about it.   
  3. Rewrite my app from scratch Yet Again, to match the new STM’s warts/features.

MPI is in this position.  Every large parallel scientific App Out There is intimately tied to the marriage of parallelizing_compiler + MPI_runtime + computer_archicture.  Changing any one of those pieces requires the app to be rewritten from scratch in order to achieve a fraction of the required performance.

The parallelizing compilers have stablized in a performance way.  There’s a coding style which will auto-vectorize/auto-parallelize/auto-stripe/etc and there’s some codes “on the edge” (some compiler+hardware combos yes, some no), and there’s a well known “don’t go here, the compiler can’t hack it”.  The compiler support for STM is very much lacking and/or in-flux.  Your heading into this terrain now, as you add Clojure compiler support to your STM.  Me, as an STM user, can’t know what’s in store for me as you go through the learning process.  I know that I don’t know what STM coding styles will be performant and what will not.

The MPI_runtime has now stablized (I believe, not sure) after 20+ years.  Again there’s the “this is good, this is marginal, this is bad” folk wisdom that drives coding.  Again, for STM’s, this area is very much in flux.  Go read some of the bazillion academic papers Out There.  Everybody’s got their own take, every STM is good in this domain & bad in some other domain – and all the domains are different; god forbid I write a large program dependent on STM ‘A’ and later try to switch over to STM ‘B’.

The computer_arch has been stable for message-passing for some time.  For STM, I believe it’s trying to ramp-up.  i.e., the computer_arch for STM is “all the world’s an X86” right now, and all hardware vendors are furiously studying what it would take to add some kind of STM/HTM/hybrid support.

Thus, for STM to make it to the ‘Promised Land’ – the STM industry needs to figure out:

     

  • what belongs in the compiler   
  • what belongs in the runtime   
  • where there are Dragons and where there are Green Pastures   
  • teach the STM users which paths lead to Dragons or Green Pastures.

If we BOTH don’t go through the excercise, then STM will never hit the promised land.

MPI never really “made it”; the ‘Green Pasture’ area was too small, and it was always surrounded by steep performance cliffs leading to Dragons when you slipped off the edge.  Upgrade your hardware: rewrite your app.  Upgrade your compiler: 2 months of performance debugging. Upgrade your MPI: 2 months of performance debugging to discover you need to rewrite your app.

GC “made it”; the Green Pasture gradually got bigger & bigger; Dragons remain lurking – but only after you filled up an ever-growing Green Pasture and needed to poke at the edges of stability.


We Agree to Press On, Bringing Light into the Darkness

Raoul Duke

it would be pretty nifty if you two had the chance to get together to try out & blog about Clojure STM on an Azul box 🙂

Rich Hickey

it would be pretty nifty if you two had the chance to get together to try out & blog about Clojure STM on an Azul box 🙂

That could be fun.

Cliff Click

 

    it would be pretty nifty if you two had the chance to get together to try out & blog about Clojure STM on an Azul box 🙂   

  That could be fun.

Send me an email; Azul hands out ‘academic’ accounts all the time.  You get a remote login; you upload your code & run it.

And, if you don’t mind, I’m going to edit this entire thread for clarity and echo it on my blog. This is good stuff (this conversation), and I definitely wish you the best of luck.

Cliff

Post JavaOne Update

I’d like to say I attended dozens of JavaOne presentations, but I barely managed to attend two plus do my own three talks, plus interview 3 job candidates (we’re going to make all 3 offers), plus attend a Scala working dinner with Martin Odesky, catch up with some old friends, surf the pavilion, meet with 3 or 4 potential customers, visit the infamous Puzzle Pirate’s HQ and have dinner in the oldest continuously operating restaurant in SF.

 

I attended Bill Pugh‘s “Defective Java Code: Turning WTF Code into a Learning Experience”.  As usual for Bill, the talk was very entertaining with plenty of OMyGod code examples from big well-known projects, all found with static analysis and pattern matching using his fabulous FindBugs.  I found a good writeup in this blog, but I can’t find a copy of the slides online.

 

I attended Brian Goetz’s “Let’s Resync: What’s New for Concurrency”.  Again, I found a good writeup here but no slides.  The Big News: Doug Lea’s Fork-Join is coming along well and will be in Java7.  It promises to be a very easy to use well-scaling divide-and-conquer style parallel framework.  I’d love to see how it works on one of our big 768-way boxes

I gave 3 talks:

  1. Debugging Data Races.  I talk too fast, or at least I did during the memory-ordering portion of the talk.  Most folks liked the 2nd half (some fairly concrete examples and suggestions), but the 1st half needs some timing/pacing.
  2. Towards a Coding Style for Scalable Nonblocking Data Structures.  Sample blog entry here.  Biggest complaint from the talk: I went *way* to fast.  Sorry – it’s a talk I’ve given a few too many times and I’m getting lazy about presentation.  The code from SF high-scale-lib got a bump: 50 downloads this last week.  I’m always psyc’d when I hear feedback from this talk; it looks to me like a 1-hit-wonder (well 2.5 hits) but I’m always left with the notion that with better tools (ala what the hardware folks have built) that this approach should have some legs.
  3. JVM Challenges and Directions in the Multi-Core Era.  I split the last talk with Brian Goetz.  Two-speaker talks are always hard to pull off, but I think we managed a great job.  Sample blog writeup here.  Funniest question (to me): “How’s .Net doing here?” My answer: “I am not my brother’s keeper!“.  Seriously, I’ve had dinner talks with some of the .Net runtime group, and they are shockingly out of touch here.  I forsee a large multi-core train thundering down Microsoft’s self-induced tunnel vision.

Azul News: We’re hiring again!  We managed to survive the downturn in the banking industry.  We spent the last year widening our customer base, and it’s finally paying off.  Plus the banks are buying again!  I interviewed 3 candidates who were in town for JavaOne.  I think we’re making offers to all three.  Mentally exhausting, moreso for them than me I daresay. 

 

I had dinner with the bay area Scala working group (sorry couldn’t find any official link), and Martin Odersky at Buca di Beppo’s.   It was a fun geeky talk/dinner.  I’m please to report that (unlike say JRuby) Scala “maps” well onto a JVM.  Code performance is probably very transparent (very sad, but no personal experience here) and very good.  Martin claims he made no real compromises in fitting the language to the JVM (but perhaps some minor warts with ‘null’ or a lack of unsigned types).  A harder issue is getting the Actor model to map well.  I personally am a fan of “no shared variables” and I hope he gets a good fit figured out.  Like most functional languages, Scala would like to see tail-recursion promised from the JVM vendors/spec (and rumors from John Rose that this might happen to HotSpot).  With tail-recursion, the cost-model of tail-calls is that they cost no more than a ‘goto’ (which after JIT’ing can be quite low indeed).  The moral equivalent for data-structures is linear logic.  The key here is allowing the JVM and/or JIT to figure out when a message cannot be referenced by the sender again and hence the “message passing” isn’t really a copy-cost – it’s no more than a pointer change.  I.e, similar to tail-recursion, the cost-model becomes “pointer copy” – such that you can look at the code and know “this operation is (almost) free”.  This is one of the major strong points of the C language – knowing a code’s “cost” at a glance.

 

I also got the guided tour of Puzzle Pirate’s headquarters.  All I can say is: Wow.  I’d love to work in that space.  From there I took a hike to have dinner at the Tadich Grill.  This is the most non-San Francisco restaurant I’ve ever eaten at in SF.  The both the staff and clientele were as about as white as you can get.  The food and service (and prices) were all great – but I definitely felt the lack of SF “color”!  No blue-dyed spiky hair, no leather&stud bracelets, no piercings (other than female ears), and exactly 2 customers not as pasty-white as myself (and I arrived with one of them).  It was so different, that it felt like a refreshing change from the usual SF polyglot… but I was glad to walk out the door and see somebody who didn’t look like a mirror image of myself.  Like programming languages, sameness is bland in the short term and a form of death in the long term.  Viva ‘la difference!

 

Cliff