Cliff’s New England Fall Travel Classic

 

Or how I went to New England to view the Fall Colors in Perfect Weather,
and All You Got was this Lousy Blog

 

Azul sent me to Manhattan (my least favorite place on the planet) to speak at the NYC Java Users Group.  I discover that the earliest flight out of San Jose (plus 3 time zones) gets in NYC at 4pm – and that you can’t get from a NYC airport to downtown Manhattan in 2hrs during rush hour – so I can’t make a 6pm talk if I leave San Jose on the same day.  Sigh – it’s the Red-Eye for me.

 

Wednesday: I end up on the red-eye from Tuesday the night before and arrive at JFK at a delightful 6:30am local time.  No sleep for me.  I’ve done the NYC subway system a number of times so it’s not much problem for me to buy a MetroCard and make my way to downtown.  I arrive at the hotel at maybe 8am in the morning.  Lo!  The hotel is full and they have no room for me – and indeed have no room until 3pm. Certainly by then I am one shabby dude – unshaven for 36 hours, same clothes, no shower, no sleep – not quite up (down?) to the standards of the street people hanging around the hotel but definitely looking grubby.  And I’m exhausted.

 

3pm arrives and I finally get my room and my shower, change clothes, lay down…. bad mistake.  I barely scrape up enough brain cells to force myself out of bed, and stumble, shaking from sleep deprivation, to the clock: only 90 minutes gone in a blink.  I splash some water on my face and go to meet the JUG.

 

The talks go very well; lots of good questions from an animated audience.  I presented Challenges and Directions in Java Virtual Machines and Towards a Scalable Non-Blocking Coding Style, and I got quite animated myself – all in all very good.  Dinner was some close-by expensive steak joint, but excellent food.  NYC definitely does steaks well.  Finally the hotel bed looms… and dawn breaks.

 

Thursday: Today is 2 customer visits, one planned and one directly due to the JUG talk the evening before.  Alas the new customer has micro-second response time requirements; we can only provide millisecond response times (albeit with massive throughput).  The planned customer visit goes much better; it’s a tools&libraries group looking for advice on scalable Collections.  After that I take a train to visit my Uncle’s farm in Connecticut.  Penn Central is full of these little train ticket kiosks; I happily approach one.  It takes my credit card, works me down through the menus to New London, CT – then hangs.  I try another; this one takes 30sec before posting loudly “Out of Servce”.  I’m 0 for 2 so I sigh and get in the infinite line to see a human.  The line takes a full hour to buy a ticket.  Then I go wait for my (now a full hour later) train.  Just when it gets near the head of the list of arriving trains, the train board reports my train as “delayed: 10 minutes”.  This goes on for 20 minutes before it changes to “stalled”, then finally “now on track #9”.  But all goes well after that, my Uncle picks me up in New London and we drive off to the farm.

 

Friday, Sat, Sun: The next few days are a real delight. The weather is perfect – highs maybe 65, lows in the 50’s, breezy, no clouds and the trees in full color.  My cousins are all married and with kids all the same ages as my kids so it’s a great time to visit.  I go to a soccer game, the end of a 5K run, a local chocolate & wine tasting, buy from a craft fair and shop at the local antiques shops; help (interfere?) with some hay harvesting, take a stab at fixing 2 Windoze machines (bizarre XP SP3 things about not being able to install Beethoven.wm3 after about of hour of disk grinding), and just plain visiting.

 

Monday: I pack up my suitcase, both physical and mental, and prepare to jump back into the fray.  The flight to Nashville is easy – except for my last child is finally going off to public school (we’ve home-schooled all our kids for years with great success), and my wife is in a frenzy getting him ready – and I’m a thousand miles away.  So I get to hear about it between plane flights, in airports, in the cab, etc. The Nashville hotel is nice, and only $150/night for a room twice the size of the NYC $500/night room, and a much nicer neighborhood. 

 

Tuesday, Wednesday, Thursday: OOPSLA.  Paid $710 at the door – I forgot to pre-reg.  I had great a lunch conversation Doug Lea, Bill Pugh, & Ben Zorn.  I take lots of notes.  Lots of great hallway and dinner conversation covering reader/write locks, the state of the JCP & Exec Committee, whether Sun should drop the JCP process altogether, whether the market will implode & we should all run up our credit cards to the max, end of civilization, etc. Ben tries to convince me that Microsoft can be trusted as the sole holder of all my critical personal information (sorry Ben) and that totally transparent sync’ing of all my portable & desktop devices is coming Real Soon Now.  Perhaps – but I’m never going to trust my personal info to some single central server – even if it does get old-school Ma-Bell-land-line-style 5-9’s reliability for both the server and the sync’ing.  Too many companies have said “it’s safe with us” and been wrong.  Too much motivation for crooks if all the eggs are so close together.  Other conversations that stuck in my mind:

  • Azul could Open Source or JCP our hacks for +UseLockedCollections (see about 1/2-way down the blog) and +UseCheckedCollections.  This could be a Good Thing for Azul as it will reduce the number of apps which fail “out of the box” on Azul as well as raise awareness of the issue.  Note that all major App-Servers (and many a large 3-letter-acronym company’s major product) fail on Azul without +UseLockedCollections (and runs great with it) and the issue has nothing to do with Azul except we have more cores.  I.e., I predict these crashes will soon be coming to an X86 multi-core Near You!  I can hand out stack traces for any interested parties; email me for details (and yes, bug reports have been filed but between me and you I suspect most of them get filed into the round bin).
  • Bill Pugh offers that at Google they have a 3rd option: log&continue.  The logging is only invoked where now I would throw a ConcurrentModificationException – but there’s no standard JDK place for logging such things.  We certainly could add an Azul logging option where the output is available on a RTPM data-race screen.
  • Doug Lea desperately needs 3 things to make his Fork-Join parallelism have a sane API – note that Fork-Join has a really nice approach to handling large complex parallel problems – there’s no real limit on scaling plus it has a fantastic story on automatic load balancing which is typically really crucial for us.
    1. Tail-call optimization.  Part and parcel of the solution when doing recursive decomposition.
    2. Some kind of auto-boxing-removal story.  See also John Rose’s DaVinci Machine.
    3. Some kind of forced inlining story, so that a complex F-J iterator gets cloned at the site where the loop-body is declared.  This will allow the (highly megamorphic) inner loop function to be specialized per call site, inlined into the iterator loop and decently optimized.
  • Several people asked for a Weak-Key’d AND Weak-Value’d NonBlocking HashMap.  

 

Friday: I make last minute plans back to NYC to visit the tools & libraries customer group again.  It’s a 6:45am flight out of Nashville, which means be at airport by 5:am so leave hotel at 4:30am and get up at 3:30am.  Sign, another day with no sleep.  The flight to NYC is rocky as all heck, the clouds have finally arrived to ruin my days.  We bump for 2hrs out of a 2:45hr flight.  Then it’s AirTrain to Jamaca Station, Long Island Railroad to Penn Station, Subway#3 to Wall Street – I’m coming quit good at navigating NYC.  I’m almost an hour early to my customer visit – which is really an open-ended tutorial on Azul’s profiling tool: RTPM.

 

These guys are doing some fun stuff.  They have their own versions of the standard JDK collections where the expected collection size is 1million to 10+million elements.  They have very efficient striped iterators (100x speedups on a 100-way Azul box) and are looking at Doug Lea’s Fork-Join for describing their inner loop closures.  RTPM goes a long way to helping them figure out what’s going on.

 

There’s also a nifty in-memory DB thingy where we discover that turning on “-tiered” mode (contrast to HotSpot’s -client and -server modes) is worth a stunning 5x speedup (Azul usually sees something like a 15% speedup for -tiered).  We try to make the in-memory DB run that fast under a ReadWriteLock.  The customer has a very light-weight roll-your-own version (new in Java6: lots of new features in the JDK ReentrantReadWriteLock class which slows it down by 50%).  The customers’ lock doesn’t stripe the reader-counts and instantly suffers massively on the highly contended single reader count word (which the JDK version does as well).  I step their engineer through details of reader-lock-striping; it’s definitely ticklish stuff to get right.

 

Finally 5pm arrives and everybody is going home – although for me “home” is really a subway ride to Penn Station, then a train back to New London, CT for another weekend on the farm.  I am sooooo ready to be back in San Jose.

 

Saturday, Sunday: Another great weekend on the farm; I finally get to see some rain (it’s not rainy season in San Jose yet so I haven’t see rain since March).  The wind blows down a tree over the power lines and we lose power from 9pm Sat to 5am Sunday (when my room lights pop on and wake me up, grumble).  With the power out and the internet gone it really feels like a farm.  On Sunday I take the train in to Boston, then have dinner with David Bacon at Legal Seafoods & take the T to my hotel.  Gosh I almost feel like a savvy New England traveler. 

 

Monday is the 2009 VEE PC meeting.  Despite missing 1/3 of the PC members (two have new babies, one got food poisoning, several had other issues that made them dial-in only for a few minutes at a time), the meeting went very well.  We got done early and I got to take the new Boston T Silverline to the airport – it’s labeled like a subway but it’s really an electric bus (the long bendy kind) in it’s own bus-sized tunnel.  Logan was easy, the flight was smooth and I made it home (finally) at 1am local time.  It’s good to be home.

 


 

OOPSLA

Some (very raw) OOPSLA notes.  As usual with these notes, it’s my stream-of-conscious thinking with a heavy Cliff bias.  Caveat Emptor.

========
I sat through some talks about new language designs/features over JVMs – they looked klunky.  I liked this one: Mixing Source and Bytecode.  Sorta like GNU ASM with C++, except for Java.  The bytecodes are fully verifiable.  This allows you to write, i.e., a naked monitorenter bytecode – something you can’t do with pure Java.  Useful, e.g., for hand-over-hand locking or read/write locks.  Can nest Java & bytecodes so can use Java to build e.g. string constants to be used in ‘ldc’ bytecodes.

========
Tolerating Memory LeaksMike Bond – (graduating this fall, Go Mike!)

Great story about a memory leak in a 10000 line C# Darpa Grand Challenge vehicle – would run out of memory in 45minutes – Quick Fix: stop truck & reboot machine after 40 minutes.  Alas, in the real challenge (and not the test bed) garbage accumulated faster & machine crashed after 28 minutes (while truck was still rolling and so IT ran off course & crashed).

I’ve reported on this before (Tool is called ‘Melt’) and liked it then as a Neat Idea.  He basically pickles to disk stale “leaking” objects, and can tolerate leaks until he runs out of disk space. He’s using software read barriers that are costing them 6% app slowdown using Jikes 2.9.2.  The barrier is a bit-test, then if it’s set they have to CAS it off:
  b = a.f;      // load object field
  if( b & 1 ) { // low-bit is set?
    b &= ~1;    // clr low bit
    if( b & 2 ) …slow path…
    CAS a.f,b   // clr in memory as well; no need to retry failure
  }
So this little-bitty read-barrier costs 6% at the app-level???  I suspect a failure to optimize in the JIT.

Cool – most leaks are in HashMaps – and by stale objects are never accessed again EXCEPT to re-hash(!?!?!) the table when it grows.  Since the objects are actually accessed (by the re-hash logic) they appear to be live – but the only use is to re-hash them.  Perhaps we’d like a HashMap where re-hash does not involve calling hashCode but just uses a cached prior copy of the hashCode, and we can avoid calling hashCode on a stale key (you still have to copy the K/V to the new table but that can be done without touching the Key again).

========
Jolt – Reducing Object Churn in Large Programs – Matthew Arnold (IBM labs)
(why not a simple Escape Analysis or a Gen-GC???)
(but also: this is why SBA ‘works’ – there IS lots of local churn)

Implemented in IBM’s J9 & run on large programs.  Better than Escape Analysis (EA) alone?  Getting up to a 15% speedup.  Some EA’s do better: add context & go up call-stack.

Really it’s doing a EA on the root of a call-tree that contains churn.  Same as C2 (HotSpot’s -server) would do after inlining.  They are doing some fairly tricky analysis to decide where to run their more aggressive EA, using dynamic info to decide that they have a ‘large enough’ subprogram to get the right amount of context.  Then they force inlining to get the right ‘context’ to get good a EA; inline children with the largest ratio of bytes-allocated to code-size (trying to avoid code-bloat blowout).

Looks a lot weaker than Azul’s Escape Detection solution – we’d first let SBA identify places where stack-local objects are being churned, and then we’d inline & do a trivial EA.  But IBM has a working solution (in the lab) for deciding what to inline to enable EA (and J9 has an existing compile-local EA which is working fine).

EA by itself in J9 – from 1% to 9% of objects eliminated.  Just not enough context to do good EA.  With JOLT seeing 5%-30% objects eliminated (30% is for SpecJBB2005).  (Note: Azul’s SBA sees 60-70% of all objects get stack-allocated).

========
QVM- Dynamic Analysis for Debugging – Matthew Arnold (IBM labs)

Two camps:

  1. testing – high overhead tolerable in QA; deep properties relating to program correctness
  2. production – must be low overhead (or not used) so limited info comes back

How far can we go with low-overhead but checking deep program properties.  Ask user: “how much overhead is ok for this info?”  Give as much info as possible within this budget – so may miss errors, etc.  But approximately good info.

So have an “overhead manager” to meet users’budget.  Then have 3 analysis that can be managed: Typestate properties, heap probes & assertions, and finally running some Java assertions.  Overhead manager watches the sampling costs & adjusts sampling rate to keep costs within budget.  Must have fine-grained timers, must have total CPU time across all things; must have direct access to RDTSC or else accumlative errors kill you.  Total CPU was hard for them to get from Linux (I believe Azul has this via azps/aztop). 

Different program points are sampled more or less heavy.  Really, sample at 100% at each program point until the OHM starts to see bad overheads for some program point – then start lowering the samping rate.  So init code gets 100% coverage, but hot loops maybe 1% or 0.1% sampling.  Thus get very good coverage of rare events and “good enough” coverage for highly repetitive events.

Type-state sampling is performed not at the method-level, but at the object granularity (steal a bit in object header – set a bit to track an object).

Heap-profiling asserts, e.g. – “is_shared(Object o)” asks the GC “is Object o shared or am I the only Thread with a ptr to o“.  Requires a full heap traversal.  Put it in the Overhead Manager, which throttles the full GC rate into budget.

Experiments: it works.  Overhead IS kept within budget for these large apps & complex tests.  Able to find several bugs in interesting large apps.

========
Contention-Aware Scheduler: Unlocking Execution Performance in Multithreaded Java Programs

Have a global per-thread lock-held-count.  If it’s above zero, hint to scheduler to avoid losing the quantum; allow up to 5 quantums to release all locks.  5 quanta is typically enough to release most Java locks.  Also – raise priority of threads holding locks; more locks held ==> raise priority more.

ALSO – if T1 & T2 are sharing the same lock/object then schedule them on the same processor.  Actually limit this heuristic to just simply enqueuing T1 & T2 on the same runqueue (if already all processors queues are busy).  (You don’t want to deny a runnable T2 a shot at an idle CPU, but if all CPUs are already time-slicing…)

Thread Clustering: record contended sync events in a per-thread ring buffer.  Each time the ring-buffer wraps, summarize the per-thread per-lock counts and then compare them to other threads.  Look for affinities; create clusters of threads sharing locks.  Migrate whole thread clusters onto a single CPU’s run-queue.  Increasing the size of the per-thread ring buffer lowers the rate at which clusters are recomputed (so can adjust the overhead; using 2K; reports 2-3% overhead (but is that per-thread or only-when-blocked, etc)).  Load balance threads not involved in clusters.

Bench’d on SpecJAppServer2005, ECPerf on JBoss4 (and others).  Run on a 16-way AMD (8xdual).  Seeing a 20% to 40% contention-reduction (what’s that?)  when using both scheduling notions (clusters of related threads, plus lock-aware quantums).  Seeing overall perf improvement by 5 to 15% on these big app servers.  Also seeing better scaling (up to their 16-way box); but not hugely better scaling.  Azul clearly gets better scaling here – but we might get even better with this kind of lock-contention-aware scheduling.

Was pointed out that since threads sharing locks are now on the same CPU, it might be that the speedup is really because of cache locality of other shared objects (not just the locking ones).

We might could do better by tracking L2-to-L2 miss rate and associating the misses with the threads on the CPUs at the time – and come up with some sort of “these 2 threads are talking to each other” metric.

========
Dynamic Optimization for Efficient Strong Atomicity
(Yet Another STM Paper; can you tell I’m not fond of STM?)

Getting Strong Atomicity is expensive (well maybe or maybe not: see the Clojure model).

Looking at whole-program analysis to discover objects NOT being accessed inside transactions – plus some dynamic analysis to find that most objects do not participate in a transaction.  Then the STM does not need to annotate/track accesses to these objects. 

NOTE that Clojure gets this “for free”.

They manage to get performance withOUT any XTN’s (e.g. single-threaded SpecJVM98) to only be 10% slower.  It’s very complex – involving compiling with “phantom barriers” around most memory updates, then stop-the-world code-patching to install barriers as he discovers objects are being used in XTN’s, etc.

========

Design and Implementation of Transactional Constructs for C/C++

Next talk is about Intel adding STM into C/C++ code – alas the new C/C++ memory model gives no semantics to racey programs and all large multi-threaded programs are racey.  Indeed the authors had to hack the handful of XTN-friendly programs they had access too to remove races.

========
Renaming for Java

“isn’t this solved already”? 

All major Java IDE’s implement renaming for Java; a very basic and very important refactoring.  But all these IDE’s have bugs.  He gives short demos & spares no IDEs.  Basically: renaming is hard.  The lookup rules are complex.  Gives a nice javac/JAST hack for doing it right  (and sometimes you *must* give up, the renaming rules are so complex).

========
Java Performance Evaluation through Rigorous Replay Compilation

Show nice violin plots of 30 measurements of SpecJVM98 & DaCapo.  About 10% variation overall, but more for e.g. mtrt.  It’s Standard Issue: samplers & tick counts (and scheduler variation) cause JIT compilations to occur at different times & places; so the basic code-gen is different.  After this variant JIT compilation performance from run-to-run in the same VM is reliable faster or slower.

Attempts to remove variation by fixing compilation load.  Replay compilation: 1st run, record compilation decisions.  Later runs: read in the “replay” and compile exactly the same way (ala Marty’s FAM).  Actually, run the same benchmark over & over; during the 1st run do all the compilation (because you can’t compile until classes load, and loading requires execution).  Later runs in the same VM no longer are allowed to JIT.

Issues with choosing a good compilation play (pick plan giving best score, or majority plan or….)  moves the non-deterimism from runtime to compilation-plan-choosing-time.

Show significant differences amongst compile plans (yes, well experienced by me; often I get “good” runs and “bad” runs based on stupid compiler decisions). 

Run ‘q’ benchmark iterations for performance, and also ‘p’ VM invocations yielding ‘p’ compilation plans plus the statistical variation during the ‘q’ runs-per-plan.  End up with p x q measurements (bleah!  Lots and lots of runs to control for variations!). 

Show violin plots for 10 plans for jython.  A lot of variation is observed for most benchmarks; i.e. “the plan matters”.  Compile plans have little overlap on which methods are the ‘root’ of compilations.

Nice really good statistical analysis of the situation.  Can show that picking different plans can significantly change the results of apparently unrelated experiments (i.e., which GC is better).  Can show that about 10% of compile plan choices on jython will cause conflicting (“lying”) results of which GC is better.  i.e., if I pick plan A then I can show MarkSweep is better than GenMS, but if I picked plan B then instead GenMS appears better than MarkSweep.

Can do paired measurements: run GCxxx and GCyyy with plan A & diff the results.  Diff the results for plan B, etc…  Now I can see if picking a plan makes a difference. Show that it does about 85% of the time. 

Really shows: you should pick more compile plans before running the benchmark more iterations.  I.e., If my time budget requires me to pick between running the benchmark in the same harness 10 times vs 10-from-scratch runs, I need to do the from-scratch runs because this will capture the non-determinsm in the compilation plan.

========
Analysis & Reduction of Memory Inefficiencies in Java Strings

Trade6 – 40% of heap is strings.  Many dup’d strings & many unused literals.  Unused literals – due to error messaes for error paths never taken.  (but I suspect unused literals are not important for Azul because their count is fixed with the application and does not scale with heap size).  For Trade6 – about 1/2 of heap strings are wasted – 1/4 are dups and 1/4 are wasted in fragmentation.  Azul does not suffer fragmentation because we don’t share string bodies.

He proposes interning strings during GC (but this changes semantics ever so slightly – things that were not “==” before become “==” afterwards.  Have a “tenured string” hash table – long lived strings go also into this table; then young-gen strings can be checked against the table by GC.  Most of the experts in the audience want to claim that the change is ok – no self-respecting Java program will be broken by this change (myself included, but I don’t want to put words in other peoples’ mouths).

Note that Azul has already rewritten the String implementation to remove the offset & count fields and we don’t share String bodies.  This hack LOWERED our average String footprint by maybe 10% and brought us a similar speedup (10% smaller heap footprint means 10% smaller working set means 10% more effective D-cache usage, etc).

They also hacked the VM to deal with the large count of unused literals.  Make a special version of String which holds the constant-pool index and a null char array.  When exploding on the null-char-array access they look for this special String & lazily make the body as needed.  Removes ~20,000 string literals from Trade6.  But note he is using tiny heaps – and the literals will not scale with heap-size, but the dup’d strings will.

When run on SpecJVM98, DB becomes vastly faster but no change on the other benchmarks.  Reduces live heap on DaCapo by 10% of heap.  For Trade6 & Tuscany reduce live heap by 20%.

========
Analyzing the Perfomance of Code-Copying Virtual Machines

Interesting hack to speed up interpreter-only languages.

========
Performance of translated COBOL programs on a JVM

Some things translate nicely; can make all COBOL variables Java instance variables for thread-safety.  COBOL divisions become Java classes, etc.  COBOL “picture” clause specifies amount of storage to be allocated, but mostly has direct translation to e.g. Java’s BigDecimal class or fixed-length Strings. 

Can specify arrays-of-structs in COBOL; again fairly easy to make a Java struct array and fill in all structures (but lots of wasted space for Java object headers).  Actually seeing lots of wastage – gives 7x space blowup example.  Seeing vast wastage in practice.

Can ‘#include’ COBOL header files, often done for just one or two fields in the header.  But must eagerly construct all defined storage space. 

COBOL subroutines include “open” and “closed” (normal) subroutines; open subroutines tend to be translated as ending with exception throws – result is most subroutines end in an exception throw.

Opt 1: share common init’d String, essentially interning all Strings of blanks. 

Opt 2: Same thing for BigDecimal for value zero but with different scale/decimal-point values.  Instead just return a common cached version.

Opt 3: All internal fields of structs & arrays are lazily constructed.  Use getters/settings, and get’ers lazily convert null’s to real storage only if asked for – for each array element or field value.

Opt 4: Remove extra throws; notice when calls just return normally and allow callee to just return.

Opt 5: Translator by default uses a lot of reflection, they now make class-specific versions of many things to avoid reflection.

Opt 6: Lots of time/date String manipulation.  So made some custom date/string conversion functions to match the specific COBOL requirements instead of using the default Java Date-conversion libs.

Overall, seeing 10-20% perf improvement, with nearly 75% of objects no longer being allocated.