Unsafe & CompareAndSwap

 


A Short and Sweet (and Deep) Brain Dump –


About Unsafe and CompareAndSwap (and NonBlockingHashMap).


 


A reader asked me: where is the volatile-like orderings guaranteed in NBHM?


 


Jargon: 



  1. CAS – Compare-And-Swap instruction (or on IBM Power chips: Load-Linked / Store-Conditional).  The unit of Atomic Update on all modern CPUs, required for any sort of multi-cpu programming.

  2. Unsafe – Java-level access to raw memory.  In old-school speak: “peek” and “poke”.  (and when I mean “old-school” I mean the talk elite hackers use BEFORE the use of “leet speak” came about; I suppose this means I’m older than dirt…).  This bypasses all the managed runtime safety guarantees, and thus isn’t available to normal Java programs.  You can get it e.g. by hacking the JVM’s boot-class-path.

  3. NBHM – Non-Blocking Hash Map.  A very high performance multi-threaded-safe non-blocking hash map.  See this SourceForge link for the code.  I’ll mention KEY/VAL below to mean any key/value pair in the mapping.

 


Annoying horizontal lines are meant to give you time to chew&swallow, before moving on to the next byte…


 




I’m using sun.misc.Unsafe.compareAndSwapObject.
It makes no guarantees about volatile ordering.  That is, the *docs* are silent.
It’s implementation (at the Java level) is a native call.
Like all native calls, the *docs* are silent as to the exact semantics.


I’m highlighting the word “docs” here because there are a ton of actual requirements demanded by the Real World.



The typical native call implementation is a CAS instruction on X86/Sparc/Azul/IA64, or a LL/SC on IBM Power.
Sometimes the JIT’s inline the native call to a hardware CAS instruction.
The hardware provides volatile-like semantics for CAS ops on X86/Sparc. 
The hardware does NOT provide volatile-like semantics for CAS on Azul or IBM Power.  I don’t know about IA64.
Some JITs on some platforms may inline additional memory fences, and thus provide additional guarantees.


HotSpot, in the C2 server compiler, includes extra fences.
sun.misc.Unsafe, since it has no spec, is free to ignore the volatile-ness of anything and everything.
It is, after all, Unsafe.

 





java.util.concurrent.Atomic.* uses sun.misc.Unsafe.*.
sun.misc.Unsafe does not make any volatile guarantees.
java.util.concurrent.Atomic.* DOES make volatile guarantees (and uses a volatile variable).
Hence, for a JIT’d inlined version of s.m.U.compareAndSwap to be the proper implementation for any j.u.c.A.CAS call, s.m.U.compareAndSwap needs to include some volatile-smarts.  This happens “for free” on X86 or Sparc, because the X86 hardware CAS op includes volatile-like orderings.  This is definitely not “for free” on an Azul box because our CAS does NOT provide any orderings.

 





Frequently, I have uses for CAS that do not require orderings and that adding orderings would entail excessive cost.  Several of the CAS ops in NBHM do not require orderings; nor do any of the scalable counters (including the CAT counter).  Hence for these uses the Azul version of s.m.U.* does a plain Azul CAS op with no orderings.  The same could be true for e.g. a IBM Power implementation, or for any hardware for which scaling to lots of CPUs is hard and unneeded fencing would slow things down unnecessarily.

Since Azul’s CAS op (hence our s.m.U.* implementation) does not force orderings, our version of the j.u.c.A.CAS ops includes extra fences to force volatile orderings (both in the native code implementation and the JIT’d inlined code). 

 





So back to the original question about NBHM orderings….
On the PUT (i.e. writer) side:


  – if a KEY is newly inserted I may not end up writing to a volatile, but I will do a CAS.  On X86/Sparc platforms the CAS includes the needed orderings.  On Azul, our simple store-buffers has the same effect… but definitely not a happy situation.  There’s no Java-level guarantee (just a guarantee on X86, or with C2/HotSpot).  Technically a bug on, e.g.,  Power platforms.  Thanks for spotting this.


 – if the KEY already exists, and only the VAL is written – the same arguments apply.  CAS suffices on X86, and Azul’s simple processors suffice.


 


On the GET side:


 – I explicitly do a volatile-read before every key-compare (even if the keys are pointer-equivalent), and a key-compare is needed to get a “hit”.  Hence there is a volatile-read before any GET call returns anything.


 


Thus for NBHM, (except for non-HotSpot JVMs on Power) there is a volatile-write before any Put and a volatile-read before any Get.


 



 


Cliff


 

Tiered Compilation

Gadzooks, time flies!  Been too long since I blogged…  no time for something deep so here’s another quicky brain dump.

Azul Systems has been running HotSpot with a Tiered Compilation system since 2007 in our production sites, and I thought I’d blog a little about how that works.

Jargon terms first:

•  C0 – a mythical bytecode-straight-to-asm dumb translator.  Typically used instead of an Interpreter.  HotSpot uses an Interpreter and so does Azul’s VM (and so we do not have a C0).  Our Interpreter is structured different from the HotSpot norm in a bunch of interesting ways (but that’s for another blog).

•  C1 – also known as the client compiler.  A fast & dumb compiler.  You get this compiler from a normal HotSpot with the “-client” flag.  Code quality suffers in exchange for faster compile times.  This makes for a quicker startup times in exchange for lower peak throughput.  The client compiler typically uses less memory to compile than the server compiler, and typically makes bulkier code per-line-of-Java compiled… BUT: it also typically inlines much less aggressively than server compiler and so does not replicate so much code.  Code frames are typically fairly large.  Performance is typically roughly 5x to 10x the interpreter speed.

•  C2 – also known as the server compiler.  A slow & smart compiler, with opposite tradeoffs from C1: slower compiles, slower startup, better code quality, higher peak throughput (after the slow compiles).  C2 typically uses more memory to compile, makes much denser code – but inlines much more aggressively (so ends up compiling a lot more lines of Java code).  Code frames are typically very dense (C2 uses a graph-coloring allocator even for stack frames).  Performance is roughly 30% (on X86) to 50% (on RISCs like Sparc & Azuls’ Vega chips) better than C1.  Mileage varies of course, according to which benchmark and which rev of which compiler is being used.

•  Tiered compilation.  BOTH compilers are used; C1 for fast startup and C2 for peak performance.  Obtained on Azul’s VMs with any of the “-tiered”, “-client” or “-server” flags, and also is the default.  (We made the “-client” and “-server” flags turn on Tiered compilation because so many customers have either flag baked into their launch scripts, which would immediately deny the benefit of tiered compilation.  We use other flags for turning off either compiler).

•  I2C adapter.  A tiny snippet of code that shuffles arguments from the Interpeter’s calling convention (which maps pretty directly to the Java Virtual Machine calling stack) to a compiled-code convention.  The compiled-code convention is machine specific, and generally passes lots of arguments in registers.  I2C adapters are generated lazily based on need, and you need a different one for each unique argument signature.  In a typical largish program you have several hundred unique argument signatures so you get a few hundred I2C adapters.  On Azul/TXU we pass the first 10 arguments in registers.  On X86 we match the standard X86-64 ABI and pass the first 6 arguments in registers.  Arguments that do not fit in registers are passed on the stack, and are shuffled to match how  the compiled convention would normally make a call from compiled-code-to-compiled-code.  i.e., when JIT’d code calls JIT’d code, we use the best possible calling convention for speed.  The Interpreter finds this convention hard to work with (it uses a classic stack convention for all arguments).  The I2C adapter bridges the gap. 

•  C2I adapter.  In contrast to a I2C adapter, this one allows compiled code to call into the Interpreter by doing the reverse argument shuffle.  You might think that once code is JIT’d you’d never get back to the Interpreter, but actually code is lazily JIT’d as it gets hot and programs have large complex control flow paths.  It’s common enough for hot code to call cold (or luke-warm code) on various off-side paths.  One other refinement: both I2C and C2I adapters are true in-place argument shuffles.  No stack-space is consumed by invoking back-to-back I2C then C2I adapters.

——————-

 

Sun’s classic HotSpot uses the Interpreter to do profiling.  Sun’s Interpreter gathers stats call site frequency and branch targets, whether or not various Java safety checks have ever failed and etc.  Azul uses our C1 compiler to do the same job for three main reasons:

 

  1. We get to gather stats about 5x to 10x faster than using the Interpreter, because C1 code is so much faster than interpreting.
  2. C1 does some modest inlining; each inlined copy has it’s own private profiling data.  When C2 later comes along to JIT this code, if it also does at least this much inlining then it can use this private-per-inlining profiling data.
  3. The actual engineering is much simpler: when we make the C1 code we pepper it with counter increments, incrementing counters in a large profile-data structure.  The offset into this profile structure is a compile-time constant so we can issue plain load/inc/store ops interleaved in the generated C1 code.  For the Interpreter implementation, the offset into the profile-data structure varies with the exact bytecode being profiled.  I.e., as the bytecode pointer in the Interpreter advances, so does the matching profile-data structure… except that it’s not a 1-to-1 mapping between bytecodes and counters.  Most bytecodes need no profiling (e.g. ‘swap‘ or ‘aload0‘) and some bytecodes need lots and lots of profiling (e.g. invokevirtual).  So if you want to scale the bytecode pointer by a constant to get the profile counter pointer – you need to make all bytecodes have a fixed size profile area, but that area needs to be large (invokevirtual) yet most bytecodes (aload0) will ignore it.  It becomes a tremendous waste of space.  So instead Sun engineered a complex variable-sized mapping from the Interpreter’s bytecode pointer to the profile data structure…and debugged it for years and years.   JIT’ing the variable-sized offset into C1 is way the heck simpler.

———————-

 

Ok, so now we know enough to be dangerous.  Lets apply our knowledge to a mythical 2-function-call (plus main) program, where main calls A in a loop and A calls B in a loop.  Each invoke of A calls B 7 times.  B loops internally 10 times, but has no more calls (for simplicity). Futhermore we’ll assume no inlining (suppose A and B are each very large methods), and that eventually we want both A and B to be JIT’d with C2, using C1-generated profiling data.

 

<strong>    static void main() { while(true) { A(); } }</strong>
<strong>    static void A() { for( int i=0; i&lt;7; i++ ) { blah; B(); blahblah; } }</strong>
<strong>    static void B() { for( int j=0; j&lt;10; j++ ) { if(blah) { blah; } else { blahblah; } } }  </strong> 

Our program starts out interpreted, with main calling A.  Turns out our Interpreter DOES do some limited profiling – just enough to trigger a C1 compile when needed.  We gather function invocation counts and backwards-branch counts; both are gathered on a per-method basis without regard to bytecodes.  So when main calls A the first time, A calls B 7 times

per call to A, we have these counts:

 

    Interpreter Counts: A (1 invokes, 7 backedges), B (7 invokes, 70 backedges).

 

The compile threshold for C1 is 100 counts, we’ll trigger it either on an invoke or a backedge any time the sum of the two counters crosses the 100 threshold.  So main calls A for the second time, about mid-way through the execution of A on some invoke of B we have:

 

    Interpreter Counts: A (2 invokes, 10 backedges), B (10 invokes, 90 backedges).

 

The Interpreter now calls into the JVM runtime, the JVM decides that B has 100 total counts, does not have any C1 or C2 code handy, nor is a C1 (nor C2) compile in-progress, and that furthermore method A is too large to guarantee that it will inline B – and it’s method B that needs to be JIT’d.  So the JVM fires off a C1 JIT of method B.  This JIT’ing hits a work-queue of to-be-compiled tasks, and a spare C1 compiler thread picks it up and starts JIT’ing.  Meanwhile, the JVM returns back to the normal running Java thread and wants to continue execution… but it doesn’t want to keep triggering C1 compiles – so the existing counts are moved “off to the side”; not forgotten but not triggering more calls back into the JVM.  In our example, let’s assume that the C1 code isn’t ready until we return all the way back to main.  Our counts are now:

 

    Interpreter Counts: A (2 invokes, 14 backedges), B (4 invokes, 50 backedges, old counts: 10+90).

 

You’ll notice that the counts for B are now split into “current counts” and “old counts”.  Generally we just sum them all together, but the Interpreter just looks at the “current counts” to avoid endlessly calling back into the JVM only to discover that a C1 compile is in-progress.  Meanwhile the program executes, and the JVM installs the C1 code, and sets a code-pointer up that the Interpreter inspects each time it makes a function call (e.g. invokevirtual).  So let’s continue our example (with main calling A again) until just prior to the next call to B (so one more invoke of A but no backedges nor invokes of B).

 

    Interpreter Counts: A (3 invokes, 14 backedges), B (4 invokes, 50 backedges, old counts: 10+90). 

    C1 – B Counts:  0 invokes, 0 backedges, plus lots of other bytecode counters

 

At this point the Interpreter has called B via the normal invokevirtual bytecode – and is now re-entering the Interpreter for a go at method B… and spots that B has JIT’d code ready.  There is the start of a Interpreter frame for method B on the stack.  Instead of interpreting B, we jump to an I2C adapter (which is lazily made so might have to be made *right now*) which will jump to the freshly made code.  At this point we’re executing our C1 code for B.  This eventually returns back to the interpreter’s execution of A (plain old function return, with a plain old stack frame pop, no C2I adapter), and A continues until it returns to main.  Our counts are now:

 

    Interpreter Counts: A (3 invokes, 21 backedges), B (4 invokes, 50 backedges, old counts: 10+90). 

    C1 – B Counts:  7 invokes, 70 backedges, plus lots of other bytecode counters

 

Notice that the Interpreter counts for B have stalled; we’re counting in C1 JIT’d code now.  We continue until about midway through the dozen’th invoke of A, at a backedge on the loop in A:

 

    Interpreter Counts: A (13 invokes, 86 backedges), B (4 invokes, 50 backedges, old counts: 10+90). 

    C1 – B Counts:  72 invokes, 720 backedges, plus lots of other bytecode counters

 

As we take the backedge and bump the backedge counter to 87, the Interpreter notices A crosses the magic total-count threshold of 100.  So again we bounce in and out of the JVM, triggering a C1 JIT of A.  Again for simplicity, let’s assume it’s not ready until we return back to main and main again invokes A:

 

    Interpreter Counts: A (13 invokes, 91 backedges), B (4 invokes, 50 backedges, old counts: 10+90). 

    C1 – B Counts:  77 invokes, 770 backedges, plus lots of other bytecode counters

 

Now we start running A in C1 code also, and basically the whole program (except for a tiny bit in main) is running at C1 speeds.  Before executing C1-for-A the first time we again shuffle our Interpreter counts around:

 

    Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90).

    C1 – A Counts:   0 invokes,  0 backedges, plus lots of other bytecode counters

    C1 – B Counts:  77 invokes, 770 backedges, plus lots of other bytecode counters

 

So now C1-for-A executes and on the first time through it hits a call-site that is not resolved.  To match Java semantics, all call-sites are resolved lazily, as late as possible (for C++ programmers: we essentially do dynamic linking always – but with zero overhead once the link is established).  This call-site is wired to trampoline into the JVM for resolution; and the JVM will figure out that it’s a call to B.  The trampoline will have saved the return address (back to method A) in a place where the JVM can find it, and the JVM will use this return address to find the unresolved call site – and the actual call instruction.  The call instruction is patched to now call B.  Yes, this is classic self-modifying code at work: the call instruction USED to target a trampoline and NOW targets the “best target” for method B.  Since B has been JIT’d with C1 code, the call is patched to the C1-for-B code.

 

    OLD: …lots of C1-for-A-code… call unresolved ….more C1-for-A-code…

    NEW: …lots of C1-for-A-code… call C1-for-B ….more C1-for-A-code…

 

Of course, machine instructions are nothing more than a little bit of data in memory, so the patching isn’t much more than a write to memory.  It has to be atomic (because other threads can be running the same C1-for-A code and will see the call mid-patch if that’s possible), and so it has to not span an ICache-line boundary (only an issue for X86), and is updated with a CAS op.

 

Back to our example, the C1-for-A code finally ends up calling C1-for-B code.  After 1 invoke of the shiney new C1-for-A code the counts are:

 

    Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90). 

    C1 – A Counts:   1 invokes,  7 backedges, plus lots of other bytecode counters 

    C1 – B Counts:  84 invokes, 840 backedges, plus lots of other bytecode counters

 

Much later the counts are:

 

    Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90). 

    C1 – A Counts:  119 invokes,  832 backedges, plus lots of other bytecode counters 

    C1 – B Counts:  909 invokes, 9090 backedges, plus lots of other bytecode counters

 

and on the next invoke of C1-for-B, we trigger the next threshold: 10000 counts triggers a C2 compile.  Same story as before: the C1-for-B JIT’d code makes a trampoline call into the JVM, the JVM decides it’s time for a C2 compile of B (and not A).  Method B gets JIT’s by C2, using the profiling done on the last 10000 or so trips around the loop in the C1-for-B code.  This takes awhile, so we go back to running in the C1 code until the C2 code is ready.  Again the counts are shuffled to avoid calling into the JVM until we are ready:

 

    Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90). 

    C1 – A Counts:  119 invokes,  832 backedges, plus lots of other bytecode counters 

    C1 – B Counts:  0 invokes, 0 backedges, old: 909+9090, plus lots of other bytecode counters

 

Let’s assume we run on until here:

 

    Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90). 

    C1 – A Counts:  200 invokes,  1400 backedges, plus lots of other bytecode counters 

    C1 – B Counts:  414 invokes, 4140 backedges, old: 909+9090, plus lots of other bytecode counters 

    C2 – B Code

 

Now the C1 code doesn’t check for the presence of C2 code directly, it just looks at the counts.  As part of installing the C2 code we reset the C1 counts to trigger a call back into the JVM:

 

    Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90). 

    C1 – A Counts:  200 invokes,  1400 backedges, plus lots of other bytecode counters 

    C1 – B Counts:  1323 invokes, 13230 backedges, old: 0+9090, plus lots of other bytecode counters 

    C2 – B Code

 

The next call to C1-for-B from C1-for-A triggers the overflow check, which trampolines into the JVM.  The JVM finds the C2 code is ready and uses the trampoline to again find the call-site that used to call C1-for-B and patches it yet again:

 

    OLD: …lots of C1-for-A-code… call C1-for-B ….more C1-for-A-code…

    NEW: …lots of C1-for-A-code… call C2-for-B ….more C1-for-A-code…

 

Now as we run along, the Interpreter is executing main, which calls C1-for-A in a tight loop (via a I2C adapter), and C1-for-A both profiles, and calls C2-for-B directly (no adapter).

 

Eventually, C1-for-A will trigger a C2 compile of A, which will get installed.  This new C2-for-A code will have it’s own not resolved call site which will eventually resolve to the C2-for-B code, and the system will settle down to running all C2 code (except for a tiny bit of the interpeter).  The C2 methods have no profiling overheads, but are JIT’d with good profiling information.

 

    Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (4 invokes, 50 backedges, old: 10+90). 

    C1 – A Counts:  1300 invokes,  13000 backedges, plus lots of other bytecode counters 

    C1 – B Counts:  1323 invokes, 13230 backedges, old: 0+9090, plus lots of other bytecode counters 

    C2 – A Code, calls C2-for-B internally 

    C2 – B Code

 

———————-

 

That’s the Grand 1st-order bit for Tiered Compilation.  Within this framework there are a lot of refinements.  I’ll add a little more detail, but then I’ll just have to run out of space lest this blog go on forever….  

 

Suppose method B goes through a phase-shift, and starts executing code it’s never executed before.  This code will have zero counts in the C1-for-B generated profile.  In the C2 JIT’d code, code with zero counts gets re-directed back into the Interpreter, generally at the branch bytecode which has never gone down that path before.  Now that we are taking this new path, we’d like to profile a bit more, then re-JIT with C2.

 

This transition from C2 to the Interpreter is called an “uncommon trap” in HotSpot jargon (and is very similar to a deoptimization event).  I’ll skip the gory details on how the transition works, but basically the running C2 is replaced with the inlined-stack of Interpreter frames.  Also we bump the profile counters for this new path (since the Interpreter does not bump path-specific counts).  A call-stack at this point in time might look like:

 

   Interpreter-frame (running main)

   C2-for-A frame (running A)

   Interpreter-frame (running B down some new code path).

 

The Interpreter’s counts for B get shuffled back, so that the next invoke of B will trigger the JVM:

 

    Interpreter Counts: A (0 invokes, 0 backedges, old: 13+91), B (14 invokes, 140 backedges, old: 0+0).

 

The JVM will in turn, decide: B is hot enough to be profiled and C1-for-B already exists, but no suitable C2-for-B exists (the existing one has been declared stale) and so the call-site in C2-for-A gets repatched:

 

    OLD: …lots of C2-for-A-code… call C2-for-B ….more C2-for-A-code…

    NEW: …lots of C2-for-A-code… call C1-for-B ….more C2-for-A-code…

 

Now the C1-for-B code gets to profile the new paths, and eventually it will trigger a new C2-for-B compile which will use the new profile data.

 

———————-

 

I hope this has given you a good flavor of how tiered compilation works!

 

Cliff

 

Part 2: Lisbon to San Francisco, via Toronto – plus ISMM and PLDI

2010 ISMM & PLDI in Toronto, CA

Skip to the techy parts…



 


As mentioned in my last blog, I am flying into Toronto from Lisbon, Portugal.  There is no direct flight, so I’m taking US Airways and connecting through Philadelphia.  The flight over is fairly easy and in daylight the whole way so no chance I get any sleep – no problem, the quickest way for me to kick jet-lag is to stay up late anyways (not that I get over jet-lag very rapidly in any case).


 


We arrive on time, but getting through customs is *much* slower than I expected.  I mean, every step moved along fairly well – but there where a *lot* of steps.  I probably showed my passport to more than 10 different people as I moved through the various queues.


 


Long story short, I made my gate with maybe 5 minutes to spare – before boarding should begin.  So really with plenty of time… except as I arrive at the gate they are announcing that my flight from Philly to Toronto has been canceled.  Ugh… deja-vu – I head over to the service desk, *again*.  I wait in line for a new flight *again*.  This time it’s fairly quick, 1/2hr at most and my turn arises.  The 5:30 flight is already booked, so I’m on the 8:30 flight.  Much better than last time – now I’m only being delayed 5 hrs instead of 24.


 


I settle down to do some email… and discover that my laptop charger is fried.  Apparently taking it to Europe was a 1-way trip.  It worked fine with 120V before I left, it worked fine with 240V overseas, but it’s not working now!  Of course the battery is dead, and I’d like to check email, finish blogging, skype people at work, etc, etc… but no-go.  So *now* I really got some time to spend.  I walk the airport to confirm they do not have a ‘tronics shop that carries a laptop charger (phone chargers are in every little tiny newspaper store but nothing for a dead laptop).  I end up catching up on some reading I’ve been meaning to do.*


 


  *footnote: I end up reading “Conversations with God (an uncommon dialog)”.  I agree with most of what the author says, but certainly not all.


 


Around 8pm, I notice the airport’s gate screen reporting that my 8:30 flight is delayed to 8:50pm.  Then it’s canceled.  Then it’s on at 11:00pm.  Then 9:30pm.  Then 9:50, then 10:10 and finally 10:15.  I am *not* making this up!  Over the course of a couple of hours US Airways changed the time on my flight a half-dozen times!


 


But finally the plane arrives, and we get boarded.  The flight to Toronto is easy (if 8 hours delayed).  I finally get to my hotel around midnight.  The room is plain but servicable and I can finally hit the sack.  The next day I shop around and finally pay $130 for a laptop charger – for a laptop who’s value cannot exceed $50 anymore – but I’m stuck!  I need the laptop for the conference!  Indeed this blog is being written right now on the laptop with my shiney new charger.  I found two food courts closed on a Friday evening at 6pm in the downtown area; rather shocking to my American sensibilities.  Fortunately the street vendors make a really nice hotdog.  Upon returning from the shopping trip, my hotel door failed (battery in the card reader died) so the hotel upgraded me to a nicer room higher up.  There was a large & loud party outside my *old* room when I went back to it (to gather my shirts hanging forgotten in the closet) – I had to ask the dozen ladies in high heels to make room in the hallway.  Later, the same party made it up to my new floor and around 1am I had to ask them to lower the volume.


 


[postlude: Saturday night: the party continues.  The the first party-er arrived at 1am and banged on their door across the hall.  Of course nobody was in the room yet, so they called their friends & held a loud phone conversation about getting a room key.  10 minutes later the the main party arrived and ran on until 2 or 3am, with about 20 people in 2 rooms and mostly spilled out into the hallway.  Sunday night they went home and I finally got some sleep.  Monday I’m dragging all day; I crash at 9:30pm and Tuesday I’m all chipper again.]


 


Somewhere along the way I figure out how to test C2 inlining heuristics in a production setting: I can write microbenchmarks with well controlled call access patterns.  If properly inlined, C2 will be able to discover a key constant which will make some outer loop infinitely fast; if not then not.  So a slow loop means: not the expected inlining which I can detect with basic timing.  Needs more work before I can put it into a production testing situation, but I got a prototype working.


 


Tuesday my flight leaves at 5pm, and I’m getting conflicting info (between the front desk and the concierge) on when I should leave for an international flight from Toronto.  We settle on leaving the hotel at 2pm; time enough for me to catch the morning sessions and Cormac’s talk at 1:30.  The flight home is easy and no mistakes – that it’s Air Canada instead of US Airways might have something to do with it.  There’s also a power outlet near my seat, so I’m busy blogging again.





Ok, On to the Good Stuff!


 


As usual for my conference reports I brutally summarize a year of somebody else’s hard work in a short paragraph; I skip some sessions and talks; I blend ISMM and PLDI sessions (and I gate-crash some LFX sessions).  Caveat Emptor.



My winners:

Evaluating the accuracy of Java Profilers– all your profilers are belong to us.  I’ve fought this fight for years and years.  I “got it right” at Azul.  I hope to bring better (heck, plain not-busted) profilers to the masses someday.
Schism: (Fil Pizlo) Real Time GC– Best real-time GC ever… as of 2010…
Memory, an Elusive Abstraction– hysterical talk moving us from before the “Era of Vagueness” till past the “Era of Incorrect Specifications” (as in: no, the hardware doesn’t really do that)


Adversial Memory for Detecting Destructive Races – really?  I can see *that* value?


GC

Heap Shape & Scalability– Yes!  I can in fact parallel-mark most heaps with 100 cpus.   😉
Concurrent, Parallel *Real Time* GC – not a bad RT/GC just not as good as Schism
Optimizations in a Private Nursery-based GC– All your cache-locality belong to us.
Economics of Garbage Collection– Fascinating attempt to cross-discipline concepts
The Locality of Concurrent Write-Barriers– Yeah!!!  Tony Hosking renames all the write barriers to sane names!!!   Boo!!!!  He fails to tell anybody that’s what he’s doing.   🙂
Detecting Inefficiently-Used Containers to Avoid Bloat– how to find HashMaps with 1 element
Detecting Low-Utility Data Structures– how to find HashMaps with 1 ‘get’ call


Safety

CETs: Compiler Enforced Temporal Safety for C– capability pointers
Parallel Checking of Expressive Heap Assertions– run this assert as you walk the heap 
Chess–  model checking for the masses.  Alas the masses need some education about model checking…
Breadcrumbs: Efficient Context Construction for Dynamic Bug Detection– see it before, latest invocation still looks good but Istill say don’t decode, just cache
Detecting JNI bugs– it’s too easy, should be on by default
Verve– proved OS; very lite-weight way to right totally type-checkable X86


Experience

JavaScript: The Bad Parts– scary JS; but well experienced by me.  Alas I see Giant programs slowly running with JS.
Keynote: Susan Eggers, “Saving the Bacon”– classic old-school compiler tech applied to FPGA’s

misc
Speculative Parallelization use State Separation and Multiple Value Prediction


 


 



CETs: Compiler Enforced Temporal Safety for C


Fix buffer overflows & dangling pointers.


 


Buffer overruns have been done a bunch before.


 


Add meta-data to the side, no change to the object data layout so works with legacy code.  Check each ptr-deref.  After some (aggressive) static compilation, get a 50% slowdown.


 


Compare to valgrind/purify.  They use external bit-map metadata to track the free/not-free status of every byte.  But if memory is freed and then re-alloc’d, a dangling pointer might operate on the wrong kind of memory.


 


Instead use an identifier-based approach (a capability).  Every malloc produces a new ‘key’; must have proper key to access memory.  So ptrs have a ‘key’ with them.  Must copy the key/lock pair with each ptr-copy.  This requires source-code modifications to help with the new ‘fat’ ptrs.


 


CETS: keep the extra key/lock data off to the side in a seperate meta-data space.  Memory based ptrs use their address to key into the meta-data.  Register-based ptrs copy the key/lock into registers (i.e., the compiler ‘inflates’ the ptr into a fat-ptr when loading a ptr from memory).


 


So memory layout is unchanged, and handles re-allocs.


Extend for stack objects in the obvious way.


 


Rest of talk devoted to the (fairly obvious) careful engineering to make it run fast.  Hooking into LLVM compilation system.  ‘Hook’ into the optimizer before register allocation, so spills do not get annotated.  Also no need to check most stack refs (compiler temps & locals).  Also do redundant-check elimination.  Remove perhaps 70% of static checks.


 


Runtime overhead is about 50%.  Overhead is mostly due to shadow-memory accesses & register pressure.



Cliff Notes: *Might* work on HotSpot.  That’s a very strong statement despite the very strong disclaimer of “might”.  Almost no memory safety tools work on HotSpot.  We’ve tried a *lot* of them over the years.



Cliff Notes: Not well defined in multi-threaded apps; data-races are not properly handlable in any case?



Cliff Notes: He’s keeping key’s unique.  Could also just never free virtual memory hence have all ptrs unique (want to free physical memory tho).  Compared to EFence – might work – but for high-allocation rate (of small objects) need a better OS mechanism for alloc’ing and free’ing memory.  mmap is too slow.


 



Parallel Checking of Expressive Heap Assertions


Checking whole-heap assertions in parallel.


“Unrestricted use of aliasing is Evil” and


“Unrestricted use of aliasing in the presence of concurrency is the Ultimate Evil”


 


Presents a classic data-race style bugs.  My classic example is one thread is setting a ptr to null, and another is trying to load through it (after testing that it’s not-null).  His example is from Swing with one thread calling “dispose” and another thread trying to render the object (after checking that is has not been disposed yet).  He’s got a few more real-life examples, but they all have that flavor: two reads with a (rarely) intervening write.


 


So want to check invariant: “Object foo is only reachable by one thread (except for e.g. the DB manager)”.


 


Expressiveness: is object shared?  is object reachable?  by avoiding some paths?  by only 1 object (after ignoring some certain other objects)?  etc.


 


Use Java Modeling Language (JML) to describe various pre-/post- conditions.  This are checked every whole-heap GC cycle.  JML Compiler is fragile right now; sometimes exits silently or does not handle some queries.


 


This work: make JMLC robust for some common queries, modify the JVM/JML combo so the common queries are also efficient.  Also hard to use: must modify source code sometimes.  User must add probes “where it makes sense”.


 


How to piggy-back on GC?


 


Implemented in IBM’s QVM (IBM J9 production JVM).


 


Can do a portable version using JVMTI but slowly.


 


Performance of the added probes appeared to be binary: either very little slow-down, or huge slow-downs.


 


Found a number of broken idioms in a bunch of real apps.  Showed a dozen apps, 3 have no bugs the remaining average about 3 bugs.  Running the queries in parallel helped a few apps but mostly didn’t help.




Cliff Notes:


He’s assuming a snap-shot-at-the-beginning.  So the answers are “atomically correct” from the heap-shape at the start of GC, but the heap will have changed so you do not get the witness.  Azul of course uses an eager-update GC.


 



Heap Shape & Scalability


Can we use many-core to trace large heaps?


 


Really: does load-balancing on 1000 cores work?



Cliff Notes: Azul sees 100 cores usefully marking 300+ Gig heaps.


 


How deep are object graphs in real programs?  Could be you make a singlely-linked list with 4Billion objects, but how deep in real life?  Limited to dacapo, specjvm98, specjbb.  SpecJBB is boring: always depth 24.  But other programs see depths from 1K to nearly 20K (pmd in Dacapo).  Other benchmarks have a max depth less than 128.


 


Deep & Narrow object graphs are Evil.  So it’s not just deep, but if it’s narrow you are capped at the number of parallel CPUs.  Want wide bushy graphs.


 


Wow… object-shape-depth graph.  At shallow depth for jython or xalan, graph is “wide” – 10000 at depth 1.  Somewhere between depth 10 and 50 suddenly graph goes from very wide to very skinny.  Some linked-list makes it out to 10000 but *most* objects are in the very wide/fat/shallow part.


 


Did idealized traversals on different heaps.  Assume perfect load-balancing, assuming perfect memory latency.  BFS traversal.  Count objects available to be scanned & also processor busy/free cpu slots.  Report utilization for different numbers of processors.


 


See that up to 8 processors utilization is typically very good.  For most benchmarks are still good for 100 cpus.  But for 5 benchmarks, have a very long tail and utilization drops way off.  Also, many GC cycles do not see the very bad graph shape.


 


Can we help the 5 bad benchmark/heap-shapes?  (1) – Add shortcuts.  (2) – Speculative roots.


 


(1) Add shortcut links.  Invisible to program, used by the tracer.  Compute the ITU both before & after adding links.  Devise a strategy for adding links: look for long-skinny chains.  Ratio of depth-vs-size(# nodes).  Need no more than 1/10th of 1% of new edges.


 


(2) Speculative.  As discussed at Azul before (patent?).  Pick some random starting points and mark.  Later see if this whole trace gets live; if not then work is wasted.  If found, then make whole trace live.  Some random dead stuff will accidently get marked live.


 


Amount of floating garbage turns out to be fairly large.  Hard to find good roots as tracing gets close to finishing.  As heap gets more marked, hard to find good un-marked good roots.  Very inefficient use of the extra CPUs.  Only a modest improvement to ITU score.  I get cite’d a few times for this idea in the slides, too bad it turns out to suck.


 



Concurrent, Parallel *Real Time* GC


 


Multi-core scheduler hacks


CPU-local GC data structures


Lock-free algorithms for GC


Non-moving collector.


 


Cool hack: require at GC times that all roots are in the heap & reachable from some common place.  So no stack scanning.  But more spill code around call sites: must squirrel away roots in a scannable place.  Imagine if every other stack was only allowed to hold oops.  Then stack scanning is as fast as scanning a large array.


 


Obvious hack for large object-ref arrays: array-lets.  Per-object scan time is now bounded.


 


Scheduler: need to do what Azul did for co-op scheduling for GC.  Need to halt all threads at  safepoints.  Actually doing totally co-op scheduling!!!  NOT pre-emptive!!!  And these points are GC points as well.  Points auto-inserted by the compiler.  Humm….I like this co-op pre-empt is also GC-safepoint.


 


Need another CAS/step in the marking of each object to help with the load-balancing algorithm.  Also impacted by heap-shape.  Pure linked-list heaps cannot be marked in parallel.  



Cliff Notes:  For his real-time apps, depth is usually fairly small.


 


Then pretty standard parallel lock-free sweep phase.  Similar to NBHM copy ops: CAS to ‘own’ a chunk to copy.  Allow extra ‘owns’.  Do stuff like striped local counters for free-memory, CAS to the global counter when the local counters/memory/lists run out.  


 



Optimizations in a Private Nursery-based GC


 


New functional scalable language.  Need a better GC.


 


Seeing score peaking of high-allocation-rate stuff at 4 to 8 cores; limiter is memory bus bandwidth for new allocation.


 


Using a thread-local private nursery.  Shockingly close to Azul’s SBA.  When nursery gets full do a thread-local GC.  Typically see 3% survival rate.  Move the objects into the public heap.  Otherwise keeping the same physical address for the private nursery, so that the PN stays hot in your cache.



Cliff Notes: Really it’s like Azul’s stack-based allocation (SBA), done again for bandwidth reasons.  Our caches are not big enough to make it pay off, and our default GC is really good, and we have lots of bandwidth so we never bothered to make our stuff production-ready.  Naturally there’s no good reference for this (I’ve presented a handful of times on this, so the slides are on the web somewhere).


 


How does he handle ‘escapes’?  Maybe classic store-barrier card-mark?   Yes – it’s a classic 3-gen system: classic stack-vars, the private-nursery (SBA area) and the shared public heap.


 


Write-barrier for escapes.  If the object has no-refs and is immutable, then we dup the object (FAIL!?!? for “==”).  Otherwise do a PN collection (Azul: thread-local stack-walk but wasn’t doing a GC).


 


PN size is important.  Too small, and endless GC cycles.  Too large, and the PN gets kicked out of the cache by the mutator.  Dynamically tweak size.  Can fake cache performance by estimating bytes-allocation-rate (size of PN, time since last PN GC).


 


A lot of time is spent stack-walking.  Expecting 10000x more stack-walks than normal GCs.  Can to trimming of previously-walked frames.  Using stubs to watermark stacks.


 


Paying a 20% throughput loss, max pause time <5ms for a 24-cpu raytracer.

Cliff Notes:  Why the throughput lossage?  Expensive write-barrier?  High rate of thread-local GCs?


 



Speculative Parallelization use State Separation and Multiple Value Prediction


 


– Execute loops iterations in parallel 


– Ignore cross-loop dependencies 


– Software thread-level spec 


– Complexities: determine frequency of speculation vs fail; user hints; recovery; user-defined rules; user-defined roll-back code, etc.


 


Main thread runs in “non-speculative” space or D-space (very helpful name! Not!)  At loops, launch speculative threads; copy their “speculative space” or P-space.  Also add a control-space (C-space) for each thread.  When thread finishes an iteration, use the C-space to confirm if speculation is still good.


 


Cross-iter dependencies have to be rare, also failures have to be rare. 


Does not quantify “rare”.


 


Later spec threads can rely on earlier spec threads – but such data is not available (yet).  So predict it.  Each iteration is predicting the results of the prior (speculative) iteration.  Attempt to “slice” prior iterations to feed the prediction logic.  Later iterations do more work (must execute the slice) but still less work than a full loop iteration.


 


Also willing to speculate lots of threads for each “reasonable” set of predictable inputs to the next iteration, based on a prior profiling run.


 


On an 8-core machine speedup varies from nothing to 1.5x.  i.e., not a whole lot of speedup for an 8-core machine.  It’s log-scale speedup (for the wrong direction of log-scale).  It’s not much speedup for lots of cores.


 



The Locality of Concurrent Write-Barriers


 


Write barriers cost locality.


 


Ugh – Tonk Hosking changed jargon names for GC write-barriers ON PURPOSE.  The names are better than the “Dykstra” barrier or the “Brookes” barrier but he needs to give us more warning!  I’ve no idea what the names mean until 1/2-way through the talk.


 


Folklore: “Deletion” barriers have bad cache behavior


 


Gives examples of 3 different write-barriers. 


Two are Insertion barriers for incremental-update. 


One is the Deletion barrier for snap-shot-at-beginning.


 


Shows the classic scan-once, then scan-again (after 1st marking) and maybe scan-again.  It’s the classic race with mutators preventing a GC cycle from ending: Azul’s GC wins again (deterministic 1-shot marking, no mutator race for us).


 


Try to measure the cost of various barriers.  Removed as many JVM-specific costs as they can; remove JIT cost, other GC cost, etc.  As the heap size grows (so GC happens less and less), the difference in write-barriers seems reduced.


 


Objects “noticed” by a mutator are going to be in the cache.  Objects that have to be touched by the write-barrier won’t have a locality problem if the mutator is going to “notice” them anyways.  Turns out the distance between needing the object for GC write-barriers, and needing it for the mutator – tends to be very small distance.  So the objects are not being evicted.  Hence they have good locality.  Furthermore the “deletion” style barrier had the best cache behavior.


 



Memory, an Elusive Abstraction


 


Keynote: Peter Sewall


 


BTW, a hysterical talk.  And since it’s a keynote there’s no paper.  You’re stuck with my writeup.


 


Golden Age of memory: 1945-1972. 


  Memory was an Array of Values 


  Finite function from address/index/X to Values/Y.


 


Multi-processors appeared, and Leslie Lamport added this: memory appears “as if” everything happened in some serialized order.  This was true in the 60’s.  But not true as of 1972, IBM 370/158MP.  But recently it’s mass-market: core-duo.


 


But modern hardware & compilers provide only “relaxed” or “weakly consistent” memory models.  Gives the most simple data-race possible.  Expect to read only 0/1 or 1/1 or 1/0 but not 0/0 given sequential semantics.  But fails on a core-two duo 630 out of 100000 runs.


 


Real processors & languages are complex & subtle. 


Most research is done on idealized models or simplified hardware.


 


Industrial processor & specs: we’ve looked at X86, Power, Arm, Java and C++ specs. *They are *all* flawed *.  Good thing he didn’t look at ours.   🙂


 


“The Era of Vagueness”.  Pre-History: before Aug 2007.  None of the memory manuals available from the hardware vendors could be made sense of.


 


“The Era of Causality”.  White papers with “principles” and 10 litmus tests.  “P1.  Loads shall not reorder with older loads”, etc.  And then reading between the lines you “invent” a hypothetical micro-architecture that meets all the principles & litmus tests.  But then, exploring this you can find more vagueness, it’s just a little further along.  And then they discover all kinds of bugs & ambiguities in the spec.  And it’s unsound: real hardware does things explicitly forbidden by the spec – although failures might be on the order of 1 per every million runs of the litmus test.


 


Nov 2008-now: closed holes in the spec by allowing more behavior.


 


Really: it’s smart people labouring under impossible situation.  Architects need to reveal enough for effective programming, but not reveal any IP, and not constrain future products.


 


So finally come up with the “x86-TSO” Abstract Machine.  Bits & parts that taken together give a clean description of what a real X86 allows – as a spec.  Wow: he gives a pretty clear description.


 


Zounds –no program is data-race-free at the spin-lock level, because the spin-lock itself is a big data-race.  SO all those DRF program theorems are utterly useless.  New results: a particular linux kernel spin-lock is OK against this X86-TSO model.  But no other locking paradigm has been proven.  All your Data-Race-Free-Memory-Models are Belong To Us.


 



Chess


– Tom Ball


 


I skipped out of the ISMM Sunday morning session to visit the LFX session.  LFX is attempting to bring practioners “out of the woodwork” and getting people who have built large practical systems to talk about their experiences.


 


Chess – it’s a testing tool, a testing methodology, a platform for research and teaching.  


 


Chess is a user-mode scheduler.  It “hijacks” scheduling from the OS.  It can take a different thread schedule each time, it can replay schedules.  


 


Common usage: fork/join unit test.  You start 2 threads, launch them, join them, then assert for something.  Chess then allows all possible interesting interleavings to be checked.  


 


Interesting Observation: For most bugs, there exists a short-running scenario with only a few threads that can find it.


 


Unit tests: better coverage of schedules, easier debugging, regression, etc. 


Stateless state-space exploration. 


No change to OS scheduler 


Ability to enumerate all/only feasible OS schedules 


Uses sync-points – schedule flips where threads communicate + race-detection 


Serialize concurrent behavior 


Has a suite of search/reduction strategies


 


Don’t capture program states.  Not needed for termination, and it’s very expensive because states can be huge.  At the cost of perhaps revisiting states.  Use some partial-order reduction to avoid revisiting too many similar states.


 


Example: insert schedule points before each lock/unlock.  Then the Chess scheduler might use random-sleeps there: 


T1:                   T2: 


  lock                  lock 


    bal += x              t=bal 


  unlock                unlock 


                        lock 


                          bal=t-x; 


                        unlock


 


Improvement 1: understand happens-before relations.  If T1 is stalled even a little, then T2 will hold the lock for awhile, and T1 cannot run in any schedule or for any random sleep.


 


Improvement 2: understand lock/unlock and run/wait queues


 


Chess converts partial schedule orders into total orders.  As-if you have a uniprocessor.  Chess makes scheduling determinstic, but NOT e.g. input non-determinism.  So network packet-arrival order is not captured.  Nor is variation of user action.


 


Fast-Mode: just this: introduce scheduling before sync, volatiles and interlocked operations.  Finds many bugs in practice.


 


Data-race mode: Find data-races.  Introduce schedule points before racing accesses.  Eventually you’ll ALL capture SC executions.  A bit more expensive than fast-mode.


 


Learning from Experience: “User forum”, Champions


 


Feedback: “CHESS doesn’t Scale” – humm, we managed to boot/shutdown the Singularity OS with Chess and found bugs, so what does “Scale” mean? What they usually mean: “It needs to run my tests lots and lots and each of my tests is slow” or “gosh the state-space is huge”.  So there was this kind of learning curve for most CHESS users, that time spent is “time-to-run-one-test times #_schedules”.


 


Feedback: “CHESS Isn’t Push Button”.  Push-button techniques are very valuable but Chess isn’t push-button. 



Cliff Notes:  Check out ‘CUZZ’ – ASPLOS 2010 – push button run giant programs with weird interleavings.


 


Feedback: “CHESS Doesn’t Find this Bug”.  Added: “Warning: running CHESS without race detection turned on can miss bugs”.  And also: turn on race-detection on for a few exections.


 


Feedback: “CHESS Can’t Avoid Finding Bugs”.  After finding a bug, would like CHESS to quit finding the same bug and go on to looking for more bugs.  Add in controls for CHESS to avoid doing pre-emption in particular places.  So can do “preemption sealing” of method M and now M runs atomically inside of Chess.


 


Feedback: Confusing error messages.  


 


Feedback: Chess isn’t real-time.  Chess treats “wait()” calls with timeouts as always doing the timeout.  “It’s a rainy day and …” your paging too death or Outlook is hanging or whatever… so the wait times out.  Now what?


 


Lots of great success quotes… but boring to read.


 


“Learning by F(l)ailing with PFX”. 


High level tips: 


– Proper expectation setting 


– good methodology & good default behavior 


– good warnings & messages 


– cultivate champions 


  – listen to them and learn!


 



Economics of Garbage Collection


Jeremy Singer & Richard Jones


 


allocation elasticity – apply to control heap growth


 


The “allocation curve” – with a large heap size you do few collections, with a smaller heap you do more collections.  There are 2 interesting points: heap so big you do ZERO collections, and the other end: the minimum required heap to let the program run at all.


 


“Law of Demand” – when something is more expensive, less is required.  Demand-Curve: curve showing that as something gets cheaper, people will buy more.


 


Price Elasticity of Demand – for a particular good, this is a measure of the responsiveness of the quantity to a change in price.  Independent of units: % change in quantity for a 1% change in price. 


 Tends to be negative.


 


Inelastic goods, e.g. life-saving drug or petrol.  People will pay any price.  Maximum Elastic goods: fixed price or nobody bothers.  Goods with lots of choice, i.e. TP.


 


   Elasticity = (dQ*P) / (dP*Q)


 


Price ==> heap size


Quantity ==> number of GCs


 


Excise tax or “duty” – similar to a sales tax; causes a verticle shift in the demand curve.  In GC, the object header increases the heap size – similar to increasing the “tax” on the heap.


 


Difference in curves, the verticle shift, varies by benchmark.  Varies because of the different mean object size.  


 


  Allocation Elasticity = (d(#GCs) / d(heapsize)) * (heapsize/#GCs)


 


Java heap growth is managed by various heuristics.  


 


Then looks are Jikes heuristics for heap size.


At program start user specifies a “target elasticity value”.


At each GC, JVM computes ‘current_elasticity’.


If this current elasticity is smaller than the targetE then grow (shrink?) the heap until we match the elasticity.


 



JavaScript: The Bad Parts


JavaScript, very dynamic, widely used, very expressive


 


Cost: performance, assurance, maintainability 


Optimizations & analysis depends on assumptions


 


*How dynamism is actually used & needed?*


 


Instrument webkit/JS interpreter.  Record traces.  Async filter the traces.  Save them offline & execute the traces abstractly offline.  Looked at top 100 sites (interesting large fraction are porno sites).  Got 8GB of JS from sides, distilled to 500MB database, down to 33 charts.


 


Results- – JavaScript usage varies widely.  Some sites use nearly 100% Date objects, some only ‘new’ some only arrays, some only mess with the DOM, etc.


 


Assumptions: 


– Program size is modest 


– Exec time is dominated by hot loops, so tracing & JIT matters 


– Call-site dynamism is low.  All use inline-caching. 


– Declared FCN signatures are meaningful 


– Properties are added at object init time.  Object ‘shape’ is defined all-at-once. 


– Properties are rarely deleted.  ‘Shape caching’ works. 


– Protoype hierarchy is invariant 


– eval is infrequent and harmless 


– industry benchmarks are representative


 


Reality


 


– Program size is NOT modest: 81K, 1.6Megs, 2Megs, etc.   


– Exec time is dominated by hot loops: 90% mark uses 0.6% of code (so yes hot   loop), or 7% or 15% or more.  i.e., some sites have no hot loops


– Call-site dynamism is low: megamorphic call sites are common; in some places   more than 1000 targets.  Also, the count of megamorphic sites is very large.  I think he missed the point though: the interesting number isn’t the number of targets for the most megamorphic site because anything beyond 1 has the same cost.  The interesting number is the %-tage of dynamic sites where an inline-cache “works”.


– Object lifetimes: mostly add a beginning, OR add fields throughout the lifetime.  Constructors are mostly monomorphic.  But a few return return 100’s of unique types.  And some sites add fields to objects forever, or delete fields shortly after construction.


– Eval: used a *lot*.  Runs arbitrary code.  Could be serialization, or JSON, or trivial.  Or making new objects or recursively calling eval or hot looping.


– Industry benchmarks: NOT representative.


 


New plan: write better benchmarks


 



Breadcrumbs: Efficient Context Construction for Dynamic Bug Detection


Sam Guyer


 


Gives sample data-race bug.  Could detect w/vector clocks or other info.  How do we record something?  What do we record?  




Cliff Notes: So it’s the olde compact/cheap calling context.  I’ve written on how to do this before: pass in to each call a hash of the current context.  Hash in the PC inside this fcn whenever you want a new context.  And then return back as an extra return value the current hash again.  See prior Probabilitistic Calling Conext work.  Very cheap to compute: 5%.  Claims no way to decode a PCC???  But it’s obvious… don’t decode: cache.  But I want record probably 100x fewer samples than Sam does.


 


How do you invert the hash back to the callsite?  Key: hash function f is invertable.  They use ((3hash+pc)&0xffffffff).  Problem: cannot search all the possible call paths going backwards.  So each step is invertable, but unclear what is the next step (going up the call stack).


 


So next idea: keep a per call hashtable of hash’s.  But must dynamically maintain this list of hash’s.  Issue: must insert into hash table on the fly, and this typically is in the million of hash-table operations.  But some call sites are very very hot and most are cold.


 


So recompile the hot fcns without this kind of hash table.  Performance hit starts at >2x.  If you throw out info above 100K executions then only 1.5x slowdown, at 10K executes only a 10% slowdown.  Now the search is complex hybrid.  Sometimes you have a hash-set and it’s easy.  Sometimes you’ve not recorded the info, and must do a global search.


 


Issue: the ‘fan-in’ to a method is key to deciding if you need a hash-table or not.  Single caller methods don’t need any help.  Same for single-digit of callers.  But popular methods with 1000 callers do need it.


 


So this technique lets you make any race detector context-sensitive.

Cliff Notes: I just assume you’d do a quick global hash-lookup and that would not be too bad.  Making the hash is cheap.  Doing a probe costs a little more, but all you need is to test whether or not you’ve seen this stack before, and only if you are about to record the hash.


 



Detecting JNI bugs


(also works with Python/C)


 


Give a state-machine description of allowed Java/JNI behaviors, and then modify the state machine at each JNI call.


 


  void Bug_producer(JNIEnv *env, jobject lref) { C_global = lref; }


 


When returning to Java, the local reference is now invalid, but dangling in the C code.  Sounds like a job for EFence!!!  Later Java calls again into C:


 


  void Bug_consumer(JNIEnv *env ) { C_global.de_ref; // crash }


 


State machine: tracks e.g. JNI local reference.  Notice that some local-ref is ‘acquired’ by C code at the JNI call.  On return, mark the local-ref as ‘released’ state.  Then later the JNI call attempts to de_ref, but the state is ‘released’ and not ‘acquired’, so trigger an assert.  




Cliff Notes: Really just mapping state-machine states to language transitions???  This is too trivial… Has a list of 10 things they check.



Cliff Notes: Really pointing out we should run with -check:JNI always on (unless we can show a performance hit), and further should do more JNI sanity checking – because it’s cheap.


 


Overhead: about 15% on Dacapo.  Actually, HotSpot is fairly robust here compared to J9, but he still finds a number of cases where HS crashes or has some other error, while he reliably throws a java-level exception.


 


Found a bunch of real-world bugs in e.g. eclipse, gnome, etc.


 



Keynote: Susan Eggers, “Saving the Bacon”


 


Dang, I was hoping for a celebrity David Bacon roast, but no go.  Instead it’s classic old-school compiler opts – but applied to FPGA’s.  Like a domain-specific language, “if the shoe fits” then you should wear it.  Auto-compiler “interesting” C kernels down to FPGA.


 


SuperScaler- both “verticle” waste and “horizontal” waste – cannot issue all 4 slots, or take a long-latency cache-miss op. 


FGMT – fixes verticle waste 


CMP – fixes horizontal waste 


SMT – seeing huge speedups?  2x to 4x on a way-way.  Dang missed context… 


Alpha chips?  Off-the-shelf execution, no compiler tricks. 


SMT – hides latency in 1 thread by executing ops from another thread. 


 


SMT Compiler Strategy: seek to issue more ops instead of hide latency.  Seek to share data in caches instead of distribute data across unrelated caches. 


Cyclic data layout vs tiling.  For SMT, cyclic is better. 


SMT is also less sensitive to tile-size.


 


Azul: We went to full cores instead of SMT; it costs just as much silicon area but has no shared-resource constraints.


 


Heading for auto-compilation to FPGA’s.  “CHiMPS”.  Auto-compile C to FPGA.


 


Emphasis on ease-of-use over max performance.  CHiMPS auto-builds caches to help with the C memory model; but the caches are distributed across the FPGA and tend to make unique caches for each C structure needing caching.


 


CHiMPS is built from standard parts: C front-end, hardware generator back-end, feeding into a standard VHDL place&route for the FPGA.  Nice results for reasonable scientific kernels.


 



Verve


 


A fully verified (simple) OS Kernel.  Looks like same style of writing X86 assembly as HotSpot assembler, but then verifying the result is type-safe throughout.  It’s a quite simple OS, but it’s also done with very light weight annotations – looks much easier as a writing style.


 



Schism: (Fil Pizlo) Real Time GC


 


Real-time GC needs hard bounds on space AND time. 


Previous work either no space bound or major slowdown. 


Heap access is wait-free, no locks/spinning. 


Proven space bounds. 


Fastest RT/GC.


 


Compared to HS parallel collector.  It’s fast but not concurrent.  Java RTS: hard space, concurrent, wait-free – but 60% slow-down; log-time heap access.  Metronome: only a 30% slowdown, concurrent, wait-free BUT fragmentation kills it.  Schism: same speed as Metronome but de-frags. 


 


Defrag: STW is easy but causes pauses. 


Concurrent defrag: custom hardware (azul) – $$$.  Normal hardware – then take a throughput penalty during defrag.


 


Replication-based GC; but writes are not atomic; loss of coherence; works for immutable objects.  Doesnt work in Java: most objects mutate. 


Seibert: alloc all objects in small junks.  No frag issue.  Access cost is known statically.  Most objects need only 2 objects.  But for arrays you make a tree; but log-time access.


 


Can we combine: replication-copying & fragmented-allocation?


 


Arraylet Spine: 1-lvl of indirection to arraylets.  But since the fragments are not moving, the SPINE is immutable!  Hence it can be moved with a replication-based GC.  No coherence issue, all accesses are statically known (and only 2 deep for arrays).  Other things do not need to move because the fragmentation allocation.


 


Schism – plus other work to make a complete RTGC.   


Also use slack-based scheduling to let the GC to run in the background. 


Use prior work to prove the GC is keeping up. 


Use INMIX for better locality? 


Use (?) to get cheap/free stack scanning & concurrent on other CPU. 


 


Schism/A – completely deterministic by always using fragmentation allocation. 


Schism/B – use contiguous allocation when possible, so better throughput in practice & on average, but hard bounds of Schism/A. 


Schism/C –


 


Compared Schism vs HotSpot/ParGC.  Slightly faster than Metronome.  Much faster than Java RTS. 


Compared fragmentation: HotSpot/ParGC compacts, Metronome dies.


 


Now look at this on a real RTS – 40Mhz LEON rad-hard sparc & a RT OS.  Java/Jikes is about 40% slower than C on the CDx real-time benchmark, but completely deterministic.   


Root-scanning is independent of #of threads… because the RT threads all stop (and allow the GC to run) only at shallow stacks.



Cliff Notes:  Best RT GC ever… at least of 2010.


 



Detecting Inefficiently-Used Containers to Avoid Bloat


 


Java containers are routinely misused: either container’s min size is much to large, or else the container has many more things Add’d than ever Get’d.


 


Many containers are required for general-purpose library calls, but most library call uses only need 1 or 2 elements – they would do better with an optimized form that does not need a general container.


 


Dynamic Analysis – used by a lot of tools.  Has a lot of false positives; does a symptom-to-cause diagnosis with lots of noisy heuristics.  Depends on specific runs.


 


Static – have to decern programmer intent.  Hard to determine frequency of operation or size of structures. 


Step1 – statically find all the Add/Get operations on all ADT 


Step2 – dynamically find Add/Get freq’s 


Step3 – report: Add only one or two (overkill structure), or Add>>Get (under-utilization)


 


Step1: ADD is really a “store op” that writes an “element” into a field inside of another object.  GET is really a “load op” that reads an “element”.  



Cliff Notes: now follows a fancy analysis to finds any generic container kind of thingy.  Question: does he find more *interesting* containers than doing the obvious thing and hooking the JDK containers???


 


Step2: do both static & dynamic analysis for # of Add/Gets.  For static, use loop nesting depth


 


Example of overpopulated container: 


  List L = new 


  while(*) { 


    while(*) { L.put(x); } 


    L.get() 


  }


 


Example of underpop’d container: 


  while(*) { 


    List L = new 


    L.add 


    L.add 


    …  L.get 


  }


 


Tested on 21 large apps; randomly checked 20 hits for false-positives, found none.


 



Detecting Low-Utility Data Structures


 


Bloat – lots of temp objects, or excessive data cpies or highly-stale objects or under/over used contains.  In other words: high cost and low benefit.


 


Example: bool isResourceDir( s ) { List dirs = computeAllDirs(); return dirs.contains(s); } 


No need to compute all DIRS to determine if 1 is in it.  So “Blame the List”.


 


But better: compute cost/benefit ratio.


 


Cost: compute total ops executed to produce the result.  Actually compute a DAG to avoid double-counting values.  Use dynamic slicing, but that  is very expensive.


 


Benefit: ?  how to compute.   


So start with: if a heap-value is used to produce an output, then value is high 


If heap-value is NOT used, then value is 0. 


If heap-value is used to help make another high-value, then it’s value is hi. 


Solved with taint-like tracking.


 


Able to find a bunch of cases in DaCapo; some small hacking on each benchmark gets speedups from 5% to 30% by using better data structure choices.


 



Evaluating the accuracy of Java Profilers


 


Test 4 profilers.  But they all disagree on the hot methods.   


At least 3 profilers must be wrong.  Which one is right?


 


How do we figure which one is right?  Timing data – so no “ground truth”.  So we have this idea of “actionable”; if acting on a profile yields results.


 


So if *slow down* a hot method, then the whole program should get slower.  Inject Fibinacci function into hot method M.  Turn this parameter of Fib counts and see if whole program slows down, and ask profiler: is M taking more time?


 


Most profiles are NOT ACTIONABLE: slowing down the hot method M causes NO change in the profile, but it DOES slow down the program.


 


Issue: these profilers are using Safepoints.  They appear in loop-back edges and method epilog.  Compilers optimize them heavily.  In particular, HS removes all Safepoints on counted-loops.


 


So these profilers are not sampling randomly. 


Make ‘tprof’ which really samples PC’s, and has to do mapping from PC’s to methods or bytecode addresses.



Cliff Notes:  I reviewed this, and immediately compared his results with Azul’s hi-res sampling profiling and got the same answers as ‘tprof’ (plus we give you call stacks).


 



Adversial Memory for Detecting Destructive Races


Cormac at it again…


 


Back in the Stone Age: Large sequenctial program, runs deterministically.  Program crashes, just re-run.  But now we have non-determinism: both scheduling non-determinism and also memory-model non-determinism.


 


So run race-detector and come up with all the fields which are involved in data races.  But many data races are benign.  Is this data-race a real bug?  Or is it an intentional & benign bug there for performance?


 


So insert a ‘software write buffer’ for these fields.  Collect old values written to this field.  But any racey read is allowed to any return old value.  IN fact, return the eldest possible value except also never return the same value in a row.  i.e., return alternative the last two values written.  In particular, return NULL and not-NULL (if these have been written in the past).


 


In practice, programs crash rapidly & reliable – if they have a destructive race.  Can annotate and give stack traces for the past racey writes.  Uses vector clocks to determine the set of writes that can be read for any given racey read.


 


For specjbb, specjvm/mtrt, lufact, sor, moldyn, tsp – basically 100% crashes.


 


For eclispe, “FastTrack” finds 27 race fields.  Then run Jumble manually once for each field; found 4 destructive races.


 




 


A talk on static scheduling for shared caches for multi-core systems.


He’s getting decent speedups: 10% to 30% across a bunch of SpecINT.


Looking for constructive cache usage opportunities.



Cliff Notes: this really has to be dynamic for biz-logic Java.  But the speedups are pretty decent, I’m just not into static pre-compilation.


 



I attended a handful more talks not worth mentioning…. and that’s all folks!


 


Cliff


 


 

Or How I Got An All-Expense-Paid Trip to Portugal and All You Got Was This Silly Blog – Part 1

I got an all-expense-paid vacation to Portugal, in exchange for giving a few talks and chatting with Joao Cachopo’s group about STMs.  His group has the only commercially successful use of an STM that I know about – the Fenix system.  The  Instituto Superior Técnico (biggest engineering faculty of Portugal) runs on it, including all student courses, grading, teachers, scheduling, etc.

Fenix ends up handling about 1000 transactions/minute in daily use, but during course sign-up periods can see 10000 transactions/minute.  A typically daily load is 750,000 read-only transactions (e.g. course queries), 15000 read/write transactions, 600 aborted transactions, and 10 with conflicts.  The typical transaction reads about 10000 transactified variables, but maybe 10 (successful) XTNs a day will read upwards of 1 million variables.

Fenix is basically 3 main parts:

– a web application, written in Java with transactional & persistent semantics,

– JVSTM, a read-friendly (MVCC) pure-Java STM

– a data modeling schema, which maps the interesting web data into pure Java objects which are otherwise transactified & persisted.

 


Or How Getting There is Half the Fun

 

….ahhh but I get ahead of myself.  The trip itself has been very interesting.  I’m leaving for nearly two weeks (one in Lisbon and one in Toronto at ISMM & PLDI, and so I’m checking a bag – contrary to nearly twenty years of personal flying habit.  Normally I can fit into a small carry-on and missed luggage isn’t an issue… but again I get ahead of myself.

Since I’m leaving bright and early out of SFO I manage to crash at a friends house in SF the night before.  We arrive at the airport plenty early, security is light and I breeze up to the gate… only to discover that the flight is delayed because of storms in Philadelphia. It’s a 1-hour delay and I had a 1.5 hour layover in Philly to make the international connection… 1/2 an hour is still enough time, but definitely getting tight.  I chill in SFO for a few hours.

The flight to Philly is easy, smooth and quiet.  Some clouds floating about when we finally get there but nothing big.  We get our gate info while still in the air, plus the passengers are told that a number of people have tight connections – and would those passengers with relaxed schedules please allow the others to exit first.  Despite all the announcements when we land a number of elderly people jump into the aisles, then wrestle veerrryyy slowly with their luggage… then slowly make their way over to the parking garage area – clearly in no rush.

Meanwhile, me and about 20 other passengers finally exit the plane and have to rush from the end of terminal C to the end of terminal A – it’s a good 1/2mile run – in about 15mins.  I keep telling myself that the flight to Lisbon is an 8 hour overnight flight, and there’s no reason it cannot wait a few minutes and make it up overnight.

As I arrive at the gate in terminal A I can hear the wailing of a woman about missing the plane; sure enough the gate is closed.  I check my watch: I’m no more than 5 minutes late, maybe only 2 or 3 late.  They must have slammed the gate shut literally in this woman’s face.  It’s unbelievable: people are puffing up behind me, the plane is still attached to the jetway – but the gate crew is firmly apologetic.  Rules-is-rules and they cannot open the gate.

In stunned silence I make my way over to the service desk… and I’m maybe person number 50 in line.  Slowly the line grows behind me: it probably peaks at 200 people and snakes down the aisle gate after gate.  The desk crew is overwhelmed; they handle about 1 passenger every 10 minutes…. a little math, 50x10min: I’m going to be here for hours and hours.

A young couple in line next to me strikes up a conversation with me and an elderly man (probably in his 80’s) traveling alone.  The old codger is spritely & opinionated, with a distinguished British accent.  We all take a liking to him immediately.  After a bit the young man takes a brief hike and returns with bagels for our little impromptu group (due to time zones, early rising, and late flights, I’ve not had a meal in about 10 hours and certainly am not getting one on my missed flight).  The couple came through Chicago, also delayed for weather in Philly (and Obama flying through Chicago).  The old fella came through LA; same sad story.  We are all going to Lisbon, we ran and ran and ran and missed the flight by mere minutes.  So we all stood in line and chatted about the stupidity of it all (they couldn’t hold the overnight 8-hr flight plane for a mere 5mins???), and the weather, and airport food, and all the other things perfect strangers can complain about when stuck together.

After about 2 hours, USAir management finally figures out that something is up with this huge line snaking through the airport (which by this late hour is nearly deserted) and bring in reinforcements for the desk crew.  They split us up in groups of 5 to 10 people to head off to other desks.  When my turn finally arrives I get the bad news:

– I’m re-booked on tommorrow’s flight at the same time, so I’ve got 24 hours (well, about 20 now) to kill in Philly.
– No food or hotel vouchers.  A “distressed rate” is the best they will do.  I’m just out $75 for the hotel, plus all meals.
– No way to get my luggage: it’s been screened for international travel and they cannot get it back from the belly of the airport.  So no change of clothing or toiletries or medicines for me.

I turn around and see that the young couple (Trent and Krystal) are in a similar state of shock.  We decide to go the same “distressed rate” hotel together and make a tour of Philly in the morning.  The hotel at least was courteous, the room was pretty nice for $75/night, and I got a courtesy toothbrush (best the hotel could do) for toiletries.  We had spotted a 24-hr diner on the shuttle ride to the hotel, so we decided to meet in the lobby after checkin.  After a quick game of Frogger across a 4 lane divided highway we get to the diner and finally all get the first real meal of the day… at about midnight Philly time.

Next day we set about on a tour of Philadelphia.  The weather is gorgeous and we have a great time.  We see the Liberty Bell, Independence Hall, the first Supreme Court courtroom, Betsy Ross House, and lots and lots of quaint American history.  We tour the last east-coast Liberty Ship (last year my son got a chance to serve on the last west-coast Liberty Ship and totally loved it, so now we can compare notes).  We take pictures (their camera; they’ve promised to send the pictures when they can, so I’m looking  forward to seeing them yet).

After a full day we make our way over to the airport plenty early: no missing THIS flight by mere minutes.  We caught up with the old codger at the gate.  Turns out that in his polite but persistent way he got the airline to pay for his hotel and his food AND upgrade him to first class.  Being elderly, charming and British has its advantages.  We only got upgrades to exit-room seats and at other ends of the plane.  We hug and say goodbye.  I didn’t get a chance to see them in Lisbon.

One final airline travel note: my luggage didn’t make this flight.  It made the original flight on time.  Turns out that after closing the gate they stopped the plane on the tarmac in order to load my luggage.  So they took my bags (and those of about 20 others) – but not me (us).  Way to go US Airways.

My host in Lisbon (called Lisboa by the natives) is Joao Cachopo, and he’s agreed to take me site-seeing on my first day there.  Jet-lag to Europe is always really rough on me, I didn’t sleep on the plane (it *was* a smooth flight, I just can’t sleep on planes).  So I’m pushing being awake 19-20hrs and it’s just morning in Lisbon.  I need to stay awake until at least 8pm tonight – another 10 hours at least. I start main-lining espresso shots (if you ask for “coffee” here you get espresso).

Joao takes me and his family on a tour around the Portuguese countryside.  It’s a great tour; the western-most point of Europe, full of crazy cliff’s (pun intended) overlooking the sea.  It’s also the weekend local motorcycle gathering point, similar to Hollister is in California – hundreds and hundreds of bikes.  We see old-town Sintra, including this charming castle.  Be sure to scroll down to the part about the secret tunnels under the waterfalls and the “Initiates Well”.  I’d say it was straight out of Hollywood except it predates Hollywood … so it’s more a case of Art Imitating Life.  We eat the local fish (ok, it’s cod – which is not local, but is very very popular in Portugal; they say it’s served in 1000 ways), sample the local sweets, walk the cobblestone streets and generally take in the lifestyle and scenery.

It’s a good thing we just did site-seeing on Sunday.  Joao and his charming wife did their best to engage me in conversation and are generally wonderful to me… but I’m exhausted.  I’m pretty much a total zombie.  I barely make it back to the hotel around 8pm before collapsing in a puddle.

 


On How I Learned Something New About STMs
 

10 hours of sleep and a shower later, and I’m much better on Monday morning.

As I mentioned, JVSTM is a read-friendly pure-Java MVCC-style STM.  After recording a version number, read-only transactions can execute entirely without further synchronization.  The version number gives the read-only transaction a snapshot of all of memory; the XTN proceeds “as if” it atomically read everything at the XTN start.  Hence at commit-time a read-only transaction can always commit.

XTNs with writes have it harder of course; they need to atomically verify and commit (atomic with respect to other XTNs with writes).  The easy answer is just to lock write-XTNs against each other.  This works great as long as there are relatively few write-XTNs.

We talk STM shop, and Joao gives me a 70-slide 1-hr presentation he’s whipped up while I slept.  That’s a lot of slides!  Apparently he’s well known for being able to do that…  The presentation is good and we get down to nuts and bolts.  Basically he’s found a really good fit between a large complex workload that’s mostly (but not totally) read-only, and it’s real-world (and to my eye that general style of web app is popular the world over), and in need of a better solution that J2EE.

Joao is using a Data Model.  The data model both describes the entities and their relationships, it also names the shared variables in his concurrent programs (things not named as part of the Data Model do not get transactified) AND is used to describe the database (i.e., persistence). 

This cute trick means that in one swoop the program:
– documents it’s data structure (e.g. to the programmers)
– describes it’s data structure to the DB
– names the transactified variables

By exclusion then, stuff not in the Data Model is not transactified. The STM does not deal with it.  Similar to Clojure, this approach names all the shared-variables explicitly (instead of assuming everything is shared by default) and thus reduces the count of shared variables by an order of magnitude (or more).  And *that* trick means the STM can be *much* more performant.

Unlike Clojure, this approach is not a new language – it’s a style of naming the shared variables (using a Data Model).  It does not require every variable to be transactified.  Shared mutable state *outside* the STM is broken (because a STM abort & roll-back will unwind unnamed variables). Programmers *can* use VBoxes manually, to handle the case of shared mutable state outside any transaction.   I sure some purists do not like this – but I see this whole approach as a powerful and pragmatic solution, and one particularly well-suited to a broad class of modern problems.
 

This is the first time I’ve heard of an STM being used in a production setting – with success.  The system is clearly alive and well, the STM plays a crucial role in both the performance and ease-of-maintenance in the system.  I’ve talked with a number of people who tried really hard in the past to apply an STM to their particular situation, and for whatever the reason those systems eventually got declared failures.

Given the success of this ‘hammer’ in the Fenix system, Joao’s grad students have been busy trying to find more nails with some success.
speculative execution
a memoizing system
on-the-fly software upgrade system
concurrent lock-free commit algorithm for more write-heavy loads

 

My more dedicated readers can probably detect a shift in my attitude towards STMs here.  To be fair, this Microsoft blog summarizes my attitude towards STMs for the past three years… and my opinion has mostly not changed.   In short: highly negative, and lacking a killer app.  I think Joao may have found the killer app for STMs.

———–

Joao ends up taking me out to dinner a number of times.  It’s very good to have a local host in a foreign land especially if you don’t speak a word of Portuguese!  The food varies from good to great (grilled seabass w/veggies, and another night some unpronounceable French chicken+figs+ cinnamon thing), and the company is excellent.  We manage a little more site-seeing (St Jorges castle).  I give 2 talks (both old ones, both well received).  My four days fly by but eventually I have to leave.

As of this writing I’m on a (very smooth again!) flight to Philadelphia (again), courtesy of US Airways (again) with a connecting flight to Toronto.  ISMM and PLDI are in Toronto this year and I was on both Program Committees.  Furthermore, Richard Jones has thrown out his back and needs somebody to chair his session.  So I’m duty bound to go in any case.

More later as I survive Toronto – but rest assured that getting *there* is yet another tale…

 

Cliff

 

 

Interesting conversation(s) with Doug Lea

I had a set of long email exchanges with Doug Lea, reprinted here (with permission).  Lightly edited for clarity and the rare NDA-only material.  I resorted several of the interleaved conversations to make them flow better – but the original emails are highly interwoven.

 


We are discussing the merits of a no-fence CAS instruction – and exposing this instruction at the Java level.  Since Intel does not have a no-fence this whole idea is a no-op for Intel.  Right now there is a weak spurious-fail-allowed CAS in the Unsafe to mimic load-linked/store-conditional ops, to get buy-in from IBM whose chips do LL/SC instead of CAS.  Doug wants to the fence-less CAS to allow spurious failure, and I do not because it causes me to add retry loops in places.

Cliff: 

Our only ordering constraint is data-dependent loads are ordered. For volatile stores we have to st/st fence (and ld/st fence – one instruction for both).  For volatile loads we have to ld/ld & ld/st fence.

Doug:

Not sure I buy the completely fenceless concept.  Without at least a st/st, what keeps a compiler from pushing all the calls in a loop to after the loop? Or even before the loop?

Cliff:  
Are you asking: with a fence-less CAS, why cannot the compiler just push it around willy-nilly?

Example before optimizing:
  while( long_time ) {
    ...stuff...
    no_fence_CAS(_counter,_counter+1); // increment perf counter with no fence
  }

 

Example after optimizing:
  while( long_time ) {
    ...stuff...
  }

  no_fence_CAS(_counter,_counter+long_time); // increment perf counter with no fence

Which is a pretty reasonable hack for performance counter work.

 

My no-fence-CAS-for-perf counters can be done on Intel with a ‘locked:add’ – except that locked:add has more ordering behavior than perf counters need.

Doug

If you think this is OK, then I believe you. but I was thinking 99% of usages where you’d like the counters to be updated inside the loop. (If only because you lose big if that one trailing CAS fails).

Cliff

Perf counters, not ‘unique id generation’. Perf counters have the property that they need to be statistically ‘close’ to reality, but not spot-on.  If you just do a volatile-variable ‘vol++’ you drop lots of counts under high load.  Easily 99% of the counts.  So you need to do an atomic update, or intel locked::add.  But there’s no requirement for ordering or perfection even.  Dropping 1% of the adds is fine.  You just need to not drop 99% of the counts.  For setting one-shot monotonic flags, it’s really the case that I need ordering, but I’ll get ordering elsewhere.

Doug

OK. If using CAS, you need a loop, which is unmovable anyway so fences don’t buy you anything. So fenceless CAS is OK here. But so is fenceless weak CAS. Which still keeps me thinking that pairing fenceless with spurious-fail is fine? 

Cliff

Fencing  – Fences cost hugely more than a run-once or run-twice loop.  Any time I can say “no fencing please” I want to.

Spinning – Under high load, my perf counters will fail their CAS once – 2nd time it’s hot in cache.  So I *should* spin my perf-counters and not fence them, but I rarely do (*some* I do, depends on context).  In the miles and miles of JIT’d C1 code, all those perf counters I do not spin-loop, I just lose the counts.  Too much i-cache pressure, too much slow-down, not enough gain.  They are rarely *highly* contended (except in a context where I’ll shortly C2 JIT anyways). 

 

Hot-locks I always CAS-loop; the ‘loser’ of the lock has cycles to spare because he can’t get the lock anyways.  He might as well tough out a cache-miss and bump the counter proper before going to sleep.

Spurious Fail – Spurious fail costs me a loop anytime I need a guaranteed CAS result (so not perf counters) and a 1-shot no-fence CAS would have done the trick – which is mostly limited to 1-shot inits.  Turns out I have lots of those; every method that can be called from compiled code needs a “C2I” adapter – these are racily made in parallel from all threads hitting this path at once (can be hundreds if a whole thread pool switches to a new task flavor) and they all do a 1-shot no-fence CAS.   Losing C2I adapters are reclaimed by GC later.  After that CAS they do a volatile read of the same field and jump to the address loaded.

 

I do C2I adapters this way, plus MethodStubBlobs (replacing HotSpot’s ICHolders), plus I2C adapters, plus CodeProfiles, plus native wrappers, plus… in short, I got lots of 1-shot inits that I racily make in parallel, 1-shot CAS install, then re-read with a volatile read.  Since I’m generally only after the data-dependent results of the 1st read, my ‘volatile read’ fence can nearly always be a NOP (except perhaps on Alpha or IA64). 

 

Now, this is stuff that’s all in Azul’s HotSpot and not written in Java code so maybe it can be taken with a grain of salt…. although NBHM clearly had places where a 1-shot no-fence no-spurious-fail CAS would do the trick, but I was stuck with either looping or fencing from the Unsafe spec.  I choose looping but I could never test the failure backedge because I never ran on a system where the hardware CAS would actually fail (i.e. I never ran on a Power system with LL/SC).

Doug

Then why not always JIT on CAS failure?

Cliff

’cause I’d have to use a CAS & test the result.  Right now I do a plain ld/add/st.  I’m talking C1 JIT’ed profiling here… once the counts hit 10000 I kick in C2.  If C1 is dropping counts for being highly racey & under load, I’ll still hit 10,000 real quick – I’ll just lose some sense of *relative* hotness between different leaky counters.

 

But for Java Atomic::addXXX flavors, yes the Right Thing To Do is to re-JIT on CAS on failure.  I’ll have to look and see what’s being emitted by C2.  I suspect the loop backedge is listed as being rare but not non-zero.  I probably should do a Heroic Optimization thingy and make the backedge zero until the CAS fails, then record the failure & re-JIT.  Not hard, but not non-zero.  Hummm…. it won’t really help the code either, because that re-try loop doesn’t hurt any other optimizations really.  It’s not like I’m really removing any interesting code from the processor’s hot-path, nor removing any optimization-blockers from the compiler.

Doug

*Spurious Fail* – I’m still skeptical about this, but I suppose it wouldn’t hurt to include compareAndSetInReleaxedOrder to calm you down about it  🙂

 


We roll into a discussion of Fork-Join behavior and leaky compile-trigger counters

Doug

“If C1 is dropping counts for being highly racey & under load, I’ll still hit 10,000 real quick – I’ll just lose some sense of *relative* hotness between different leaky counters. ”  I’m not so sure that these work out as well as you think.  I see surprising patterns in when various methods are compiled in FJ.  If you are interested in exploring some of the effects here, grab current jsr166y.jar

 

http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166y.jar

And run either of the attached simple fork/join programs with +PrintCompilation or whatever set —

  java Integrate NCPUS  {d, f}

  java LeftSpineFib NCPUS 45 5

For Azul, probably 96 is good test. As in “java Integrate 96 f” (the “d” is for dynamic partitioning, which is usually about 2X faster, but also often shows different compilation patterns. Dunno why).

Cliff

For Integrate: the #1 hot method at 85% of all ticks is “Integrate$FQuad.recEval(DDDDD)D”.  It has no inlining, nor would any inlining decision change help.  Here’s the profile with a 40Gig heap.  Note that you’re allocating 4Gig/sec here, so GC remains an interesting cost.  All other inlining decisions do not matter – they have essentially no impact on the ‘score’.  So I claim that dropping counts here makes no difference.  You have an extremely simple & dominate profile.  I’d get the same code for the hot method if I dropped 1% of the counts or 90% of the counts.  There’s a lot of FP math in there, almost no control flow, some allocations, some fences that run in 1 clk (so I suspect they are final-field fences).

Percent Ticks Source
86.1% 225,836 Integrate$FQuad.recEval
12.8% 33,435 new_tlab (vm stub code)
0.4% 1,083 SharedRuntime__new_heap (vm code)
0.2% 603 VM_JavaCode (vm code)
0.5% 1,187 Other

 

For LeftSpineFib the story is different.  There’s 2 hot methods, LeftSpineFib.compute and .seqFib. I got them inlined in one run and not in another.  When not inlined, there’s a lot of allocation and performance is ~1.3sec per iteration.  I only got the one time inlined, and it appeared to run faster and not allocate very much.

 

Nonetheless, there appears to be only 1 interesting inlining decision to be made here. All other decisions are “chump change” – I only need to get this one decision right.  I’m not clear what the issues are without a full-blown debugging session.  It would help if I could make LeftSpineFib run longer (say 10’s of minutes) – it’s all over in a minute or less so it’s hard to get a good profile, or look at the compiler’s inlining decisions.  It’s clear from the profiles that the compiler *ought* to do the right inlining: call site counts are in the 100K’s within a few seconds.

Doug:

Happy you could replicate at least with this one.  You can make this run longer with larger values of fib (2nd argument) Try a larger value than 45. It is 1.6X slower for each added value. So maybe 49 is good. The answers will be wrong because of int overflow past 46 though.

Cliff:

Switch to Long math – same speed on Azul & x86/64

Cliff

You are a tragic victim of profiling a recursive function.  I don’t get separate profiles for layer of the recursive depth and C2 hasn’t a (real) clue how many times to inline.  Recursive functions are not common In Real Life.  So some static heuristics kick in after awhile.  If you are desperate to prove that non-leaky counters will change something here (they won’t, your profile is highly regular) you can rename the recursive functions seqFcn_1, seqFcn_2, etc for as many layers as you get bored.  Then I’ll get separate profiles for each layer and I claim C2 will inline “enough”.

Doug

“Recursive functions are not common In Real Life”.  That could change if FJ gets used a lot to parallelize loops and aggregate operations (as Clojure already does). So you might want to think about this before it happens?  It’s reasonably far down the priorities though – much more important are things like the loop dispatch problems we discussed a few times.

 


A quick diversion into a bug which turns out to be mine.

Cliff:

fyi… when I run in various hotspot debug/slow modes I can trigger a situation where the FJ framework spawns threads endlessly.  I generally kill it after I get 10,000 or more threads…

Doug

For which of the programs do you see this? Any guesses about how I can try to replicate?

Cliff:

java Integrate 24 f – and a debug 64-bit HS build

Doug

I can’t replicate using the current fastdebug build at

http://download.java.net/jdk7/binaries/

on 4X2way AMD or 2Xi7 Intel. I guess I should try to build my own debug build. 

(Aside: they seem to have changed layout of fastdebug releases without changing instructions. I wasted an hour or so before just moving fastdebug .so’s into jre/lib/amd64/server for a fresh install, which seemed to work. Sigh.)

Doug:

I just did (current openjdk sources) and still can’t replicate on machines here?  When you did this, were the #threads lines printing large values, or did you just observe a lot of threads? Any other hints?

Cliff:

Lots of threads being created, and no progress on the app front.

Doug:

I take this seriously because there are some delicate invariants here wrt thread counts, which I’ve convinced myself are right, but informal proofs are no match for reality. But I’m currently thinking this is some debug-build hotspot oddity.

Cliff:

But this is with my build which itself has some interesting lock/wait/notify designs.  Maybe I can send you some stack traces of the thread that is busy making threads.

 

Cliff

[I later find the bug in Azul’s locking; related to an unfortunately timed lock deflation caused by rapid GC cycles.  Fortunately it’s an engineering-prototype only bug.]

 


We roll into a conversation about *when* to JIT.  Or really: Doug is asking for special JIT treatment of ‘master’ threads which control the launch of other threads, and I’m claiming he’s fallen victim to unknown perfrmance bug.

Doug:

While on the subject of compilation triggers, here’s a reminder of another old topic: That counters aren’t too useful in detecting need for compiling code that may not be invoked all that often, but when it is, enables a bunch of other threads to run. So speeding it up by compiling can have dramatic impact on aggregate throughput. The main examples are code with notify{All}, unpark, as well as many uses of simple CASes  (to release spinning waiters). It would be swell if code with such calls got some extra compilation points.

 

This hits especially hard in FJ, where ramp-up time to wake up all the workers can take longer than the actual computation. To improve things a little, I manually inline accessors, Unsafe calls, etc., in some of that code so that it runs enough faster even when interpreted to make a noticeable difference. If you are curious, the main points are ForkJoinPool signalWork and preJoin. For some explanation, see the internal documentation at

http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166y/ForkJoinPool.java?view=log

Cliff:

This doesn’t make any sense – unless the counters are leaking badly.  After 10,000 hits, it gets compiled.  Assuming not-leaky-counters and 100 threads a mere 100 trips through and you get C2 compiled.

 

“I manually inline accessors, Unsafe calls, etc” – I saw all these methods get JIT’d fairly early on, and after JIT’ing they disappeared from the (TXU/Vega) profile.  

 

(ed: Cliff Notes – added during blogging and not in the original email exchange – I think it a colossal JIT failure that Doug Lea is hand inlining; and now I am personally determined to get to the bottom of this)

Doug

Azul hotspot is then apparently tracking counts better than Sun hotspot?

Cliff:

Perhaps.  Also, if 1000 threads are runnable on a 4-core X86 and if the thread priorities are not properly setup and honored, then the JIT threads have to compete with the runnables and typically lose badly.  This was always a bad issue with VolanoMark until we finally thread priorities set right.

Doug:

Hopefully people don’t do that — parallelism should normally be approx #hw threads.

Cliff:

Sorry to disappoint.  I’ve seen:

– Fixed.  Always much smaller than what an Azul box needs. 

– Small constant * num_procs (small=1 to 4).  Always much larger than what the code can scale too on an Azul box. 

– More threads added as external event happens (or not).  Frequently causes endless thread-create. 

– Rarely, threads scale till either the limit of CPUs or load or locking events dictate.

 

A couple of big popular app-servers either do-it-right or can be tuned. 

Many home-grown solutions fall over here.

Cliff:

Now, if instead we have a leaky counters and the nature of FJ is to make the leaky-ness much worse (because the threads all bunch up in the same code at the same time).  Suppose of the 100 threads only 10% or 10 ‘counts’ stick   Then it takes 10 times longer to reach a point when we start JIT’ing.

Doug:

Well it does still make sense regardless of leaks  🙂 

It is possible for a single slow path that releases many threads to arbitrarily slow down throughput, and your counters would not even be able to tell. And sometimes not only possible but common. 

Cliff:

No?  It should not be possible.  After 10000 *total* trips through the code, across all threads, it should JIT.  So either you don’t hit 10000 trips through the code – which means max possible slowdown is 10000*(JIT_speed-interpreter_speed) – which will be measured in milliseconds – or else you hit more than 10000 total trips and it gets JIT’d.

Doug:

But you have to apply Amdahl’s law here — the other threads don’t run at all until woken up. Which can be roughly approximated by multiplying your slowdown calculation by #worker threads, which is normally #hw threads.

Cliff:

But… the slow wake-up thread only has to do 10000 wake-up events, then it’s JIT’d.  So 10000 times threads are slow to wakeup – e.g. costing an extra few micros per – so total app time stalled by 0.01 to 1 sec.

 

If you’re still seeing slow wakeups then there’s something else going on.  I’m totally willing to believe in stampede issues or dropped count or slow/missing wakeups or etc… but not lack of JIT’ing…. unless Sun has a bug here that Azul does not.  But I claim it’s a straightforward bug and not anything that needs special treatment.

Doug:

The wakeups normally happen only a few times per pool submission (i.e., the root of a tree of recursive tasks). So waiting for 10000 is not very nice. Knowing this, I invoke some of the wakeup code in places that hardly ever actually need to wakeup threads but at least cause that code to compile sooner.

Cliff:

But again, why can this ever matter?  Total time the app is slowed down is <1sec.  The initial task is slow to get kicked off… but not that slow, and all the remaining threads run/launch at JIT’d speed.

Doug:

I suppose that conceptually you could counteract this by increasing count by #CPUS instead of by 1 when you see notify/ unpark/ CAS?

 

More generally, slow-wakeups (not just because of not being compiled, but because unpark can be slow) is sometimes the biggest problem I see in FJ. It is easy to check that this is so by just commenting out park calls and letting them spin. Which of course I can’t do in production code. There are a lot of little things I did in a big refactoring pass a few months ago that each help this a little, so it is not currently a huge problem, but everything I can do will help a little more.

Cliff:

Yeah – here we’re talking Real Meat.  Park can be slow – or not – at the whim of a lot of factors utterly unrelated to the timing of JIT’ing.  The OS can screw up, the JVM’s lock-fast-path can screw up, the JIT can fail to inline, the notify/wait/wakeup path can screwup, hypervisors can screw up (you are *not* trying this under a hypervisor, right? – No chance to get it right under VMware until it’s Right without it), etc.  Also, spinning beats Parking if you got cpus to spare.

 

Really I suspect you’re seeing something not related to the lack-of-timely-JIT’ing.  I totally can believe the hand-inlining stuff, etc is paying off… but it’s not because the basic design needs tweaking.  There’s something else busted.

Doug:

One place to wonder about is interaction with GC. FJ spews garbage, but it is for the most part “nice” garbage. Typically 99% of tasks are not stolen, so die quickly, so there is a lot of early-generation/nursery level GC.  Relying on generational GC rather than expensively manually managing memory for tasks is the main reason we can outperform cilk and other C/C++ workstealing frameworks. But in exchange we sometimes do get GC stalls at very inconvenient times, that lead to postponing setting status fields, counts, or performing wakeups, all of which could cause other threads to continue.  Which suggests finding ways of discouraging GC at some of these points. Which I guess amounts to reducing non-inlined method calls or back-branches in a few critical places. Probably the manual inlining etc I already do helps here, but I should probably try to pinpoint other cases.

Cliff:

Interesting… also there’s interactions with non-STW GCs.  In a STW GC, everything halts except time (so System.CTM & CTN roll on but no progress in work, so heuristics get wedged).  In a non-STW threads halt more independently & randomly (but hopefully for shorter time period).

Doug:

(Also, on thinking what I wrote about spins vs park yesterday, I’m led to try a few more adaptive-spin constructions.)

 


Now we’re look at hacking algorithms to better fit the GC (Doug’s choice) or hacking the GC to better fit the algorithm (my choice).

Doug:

While I’m at it: One reason GC is a little slower than it ought to be is that I initially size the workstealing array spaces to 8192 (which is MUCH bigger than they typically need to be — 64 is always enough for plain strict FJ programs) to avoid cardmark contention (plus plain cache contention) for writes of tasks into the slots. On x86 especially, eliminating this contention much more than makes up for increased useless GC scanning costs, but probably hurts on Azul (it is on net about even on Niagara).

Cliff:

Nah, just for You, because You’re Special, we (Azul) filter cardmarks.  🙂

 

On Azul hardware, the hardware filters between generations – so only storing a young-gen ptr into an old-gen object goes into the card-mark slow-path.  In that path we do a read/test/store.  Repeatedly marking the same card only does (cache-hitting) loads, no stores.

Doug:

G1 supposedly reduces cardmark contention too, but it still is much too slow to use for FJ. 

It would be nice to have some sort of #ifdef for initial sizing.  (See that P-Ray mail for a prod of other kinds of config information I’d like to have available…)

Cliff:

I *claim* (without a shred of evidence) that our GC will blow away G1 and not need any configs.  At least, we’ve never needed any GC configs on Azul gear.  Really we claim it’s a bug if the GC doesn’t just “Do The Right Thing” right out of the box.

Doug:

But this case (sizing array of refs to 8K vs 64) is already a workaround of sorts — trading slower GC scan times for less contention. On VMs that don’t have these cardmark contention problems, I’d rather not have users pay the price of slower GC and bigger footprint for no benefit.

Cliff:

Ahhh, you are changing algorithms to help with different kinds of GC.  I *claim* (again without experimental evidence) that Azul’s GC is utterly immune here and doesn’t care if you pick 8K or 64.  Certainly an extra 8K vs 64 ref array isn’t going to matter.  And we filter cardmarks.

 


We get closer to the issue with Doug’s thread-launch-rate problem

Cliff:

Also… I see the high allocation rate on your FJ programs you’ve given me, but nothing at all about the various slow wakeup issues.  After the first 2 or 3 numbers report back (in 2 or 3 seconds), then “Its All Fast”.  Everybody is working hard, etc.  This is with both a native Sun HotSpot “-d64 -server” and Azul.  Can you send me a program which shows off the bad behavior on a plain Sun HS?  What exact gear shows this, which rev of HS, etc?

Doug:

This was harder than it seemed — all those small improvements I did a few months ago have made some of clear-cut cases I had in mind less clear-cut on x86. Some remain clear-cut on Niagara though, which I’m now guessing has more to do with GC/safepoint checks than I had thought (less inlining on sparc -> more safepoint checks -> more GC where I don’t want them?)

Cliff:

Known bug/issue w/Solaris – thread starvation.  Suppose you launch 32 threads on a 32-way Niagra.  Java threads of course.  The concurrent GC spawns a happy 20 more VM threads, plus some JIT threads.  Now your 32 workers take all CPUs, and the GC “panics” because the initial heap is small and the initial allocation rate is high – so it both bumps the heap AND spawns emergency full-speed background GC work.  The poor JIT threads starve awhile until GC gets a bigger heap from the OS and stops panic’ing….  happened to us on VolanoMark all the time, till we fixed both the JIT threads priority (they MUST make progress against the Java threads) AND the GC heuristics.  Could totally be a Sun/Solaris thread priority bug here.

Doug:

Playing around with this, I think it might play some part:  On Niagaras, I get faster compilation using a few less than #hw-threads workers in pool. Not a huge effect though.  I just saw a bigger effect with some exploratory hacks to more often avoid wait/notify mechanics in join. See the internal documentation in ForkJoinTask about why I need to use these rather than raw park —

http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166y/ForkJoinTask.java?view=log

So it may be the case that I wasn’t avoiding bias etc effects as much as I thought I was. And that I’m seeing them more on Niagaras because these machine run with a lot more hw-threads -> parallelism level than my x86s (128 for Niagara2 vs 16 for my biggest x86). This deserves more experimentation.

 

All in all these are in keeping with my sense that there is not one single big reason for sometimes-slow ramp-up, but lots

of smaller effects. Many of which I can actually do something about, which is always better than feeling like I’m at

the mercy of you JVM-level folks 🙂

(Albeit with occasionally sleazy practices like the above array sizing.)

Doug:

Anyway, I’ll try exploring this a bit more before trying to conclude much about it. In the mean time, I attach a few more tests in case you feel like looking into this more.

Cliff:

on retrospect… [ed. and some days later] see if the issue is that the launching/controlling/master thread is going to sleep.  Common OS situation: you just fired 17 threads up; they are *start* on the same CPU before spreading out to their (idle CPU) neighbors.  Along the way the master thread gets kicked off for awhile but a worker and doesn’t get a chance to keep launching threads.  Azul paid close attention to this issue: we get 100’s of threads running in a handful of milli’s – common enough in various HPC-style barrier-compute-barrier apps – but it’s hard work to get right.

 

Doug:

Been there. Pretty much under control. Here’s a paste of main explanation:

     * 4. Managing idle workers waiting for tasks….

…The net effect is a tree-like diffusion of signals, where released threads and possibly others) help with unparks.

 


Now we’re into biased-locking effects.

Cliff:

Naturally we’ve completely re-done locks so I suspect your hacks are all for naught. 

 

I think you’re looking at performance-unstable algorithms.  i.e., performance is very sensitive to conditions that you haven’t nailed down yet.  Things like whether or not some particular lock inflates early or late (more overhead after inflation if a thin-lock would have worked), or a badly timed GC stalls some threads while others get to run on, or your work load becomes unbalanced, or sometimes a shared queue gets hot in a shared L2 and CPUs get tasks fast – vs it starts cache-missing, so all CPUs complete their jobs & need the queue, so the queue ping-pongs and remains slow, etc.

 

I can totally arrange scenarios where things like queue/shared-cache placement makes a 10x performance difference.  Sometimes the queue is where it needs to be when it’s needed, and sometimes not.  I don’t have any easy fix for such algorithm sensitivities, except via careful discovery of the “performance potholes” then designing to avoid them.

 

All HS thin-locks auto-inflate as you add waiters; why do you want a special coding hack to avoid that?

Doug:

Because there is a higher-than-we’d-like probability that the signal will come before the wait.  The main unhappy case is where null-signaller reserves the lock, so the waiter (completely uselessly!!) waits for GC or some unbias point.

Cliff:

Huh?  ‘signal’ meaning the ‘notify’?  as in: you are sending a ‘notify’ to a lock with no waiters, so the notify is dropped on the floor? I guess I’m not getting it.

 

BTW, I dropped the ‘displaced header’ mechanism, and also the ‘thin lock’ state.  Azul locks have these states:

 

– never locked

– biased to a thread

– inflated monitor

– – monitor is biased (cost is same as ‘biased+one indirection’)

– – monitor is unlocked & will count recursive locks

– – monitor is locked & counting recursive locks (cost is same as ‘thin-lock+one indirection’)

– – monitor is biased & the bias is mid-revoke (cost involves safepointing 1 thread & crawling 1 thread stack)

 

Wait’ing inflates a lock & then lines up on the wait’ers list for the monitor. 

Notify Does The Right Thing, which for a non-inflated lock is a no-op.

 

I could make Notifies to biased-locks auto-inflate under the assumption that the notify is racing with a Wait’er.

Doug:

Yes, autoinflating on any notify would help here, and strikes me as rarely being a bad idea in general.

Cliff:

Noted.

Doug:

When waiting for pending join a worker does:

 

1. Decide whether to wake up or create spare thread

2. CAS in task state bits requesting signal

3. Do some other bookkeeping and counter updates

4. get task synch lock and Object.wait for task state to change

Cliff:

So it’s a missed-notify by design? Perhaps you could flip the task state bits under the task sync lock?

Doug:

I used to do something like this. It flips bias to the soon-to-be waiting thread, which is even worse.

Cliff:

why is it worse? For Azul, the call to wait() forces inflation.  And now it sounds to me like you want wait’rs to spin a bit, to see if a quick notify is coming along.  Also, inflation is a 1-shot cost per-lock. Are you constantly locking new objects (hence the cost to inflate is added to the original allocation cost)?

Doug:

Yes, each join is waited on using different task object, and 99+% of time, only one join per task.

Cliff:

Blah.  Unlike Park’s this will mean making a Monitor object per join, and reclaiming them later.

Cliff:

Also, if you are using missed-notifies by design, do you want to do timed waits?  That way if you miss you’re stuck waiter will eventually get to try again.

Doug:

I don’t know why I hadn’t checked this lately (for many months) but comparing -XX:-UseBiasedLocking shows that I definitely am getting hurt by this in at least some applications — as much as 20% aggregate throughput differences on a couple tests depending on thresholding and spinning details in preJoin.

Doug:

On the other side (a thread completing a task being joined), it is possible and not all that uncommon to see the signal request from (2) above, and thus do notifyAll in a sync block before the waiting thread even waits. When this happens, it can cause lock bias to kick in in favor of the signaller, so the waiter cannot even get the lock to do the wait. Annoying.

Cliff

This sounds like a Sun-ism.  Azul BiasedLocks can flip owners without too much overhead and we do it on a lock-by-lock basis. For Sun, revoking the bias on an object is quite expensive.  It’s cheaper on Azul, but I dunno if it’s cheap enough. Definitely if a wait OR notify hits an object I should immediately inflate and skip the bias/revoke-bias steps.

 

Doug:

I’ve considered several times avoiding all these lock anti-optimizations by making my own monitor-cache-like mechanism (and then using plain park/unpark), but have never actually tried it. I probably ought to, to see if the lookup etc overhead can be made small enough that I can stop relying on lock mechanics that can vary so much.  Seems unlikely (among other issues, I’d need weak-ref based lookup, which amounts to trading one set of uncontrollable weirdnesses for another  🙂

 

It is probably better to continue pushing on smarter adaptive spinning (as in notes a few days ago) so that unwanted lock effects become even less frequent.

 

All in all, it is probably time to stop trying to cope with interactions with VM locking heuristics and to explore some schemes avoiding sync locks. I just thought of one that also avoids weak-refs (which is what I was most worried about) at the expense of a bunch of code bulk, which might get me in trouble wrt inlining and compile thresholds etc again, but worth a shot.

 

HPC folks, even cilk, usually don’t even bother suspending threads in these cases. They own the machines, so are content to allow spins for even many seconds. We can’t do that. FJ does though support helpJoin, that does non-suspending helping joins that process other tasks while waiting for the target to complete. This works great so long as you have “strict” computations (pure trees, no nested parallelism). That LeftSpineFib test uses this, and has much lower variance across runs of top-level tasks. Too bad that layered APIs like ParallelArray or Clojure/Scala/X10/java-with-closures frameworks can’t usually use it because they don’t know shape of computations.

 


Topic change again…

Doug:

When using FJ for parallel forms of apply, map, reduce, etc on arrays and collections, we’d like to have some idea of how much computation per element there is. As in coll.apply(addOneToEachElement) would only spawn subtasks if a lot of elements (and cores), but coll.apply(findNextMoveInGame) might always do so. It seems that performance counters could provide some help here if they were exposed in some way to library code. Without this I can still do OK in many cases by using queue-sensing etc to determine dynamic thresholds. The only big losses come in the extreme cases where you don’t want to partition at all (like add one to 10 elements), which would be nice to avoid.

 

(In essence, counters are an approximation of WCETs used for similar purposes for RT stuff.)

 

Any thoughts on plausibility? If you think this might go anywhere, I’d probably want to drag in some Sun hotspot and IBM j9 folks because we’d want to at least semi-standardize it.

Cliff:

You could sample via nanoTime execution.  You could trend the last 10 (100, 1000?) nanoTime execs pretty cheaply.

 

You could also waffle the chunk size up & down, profiling as you go.  Average-time-per-chunk seems low?  Then increase chunk-size.  Seems high (and CPUs are available?)- make more smaller chunks.  etc.

 

(this conversation is ongoing, and currently Doug is “it”.)

 

Cliff

 

 

Un-Bear-able

More news from the Internet connected-tube-thingy:

 

Cool –

From: cmck….

I’m going to reference the blog on the landing page tomorrow. I know the readership will be more than pleased that we successfully poked the bear and got some insight from Cliff.

 

Also –

 

TheServerSide.com managed to raise the ire of Azul Systems’ Cliff Click, Jr., …

 

I’m not just a bear, I’m an irate bear!  

 

Just so’s you know – it takes a lot more than a casual question about “where’s my Java Optimized Hardware” to make me irate.  In fact, that’s a very good question – because the answer is not obvious (ok, it was obvious to me 15 years ago, but I was already both a serious compiler geek and an embedded systems guy then).  But it’s not-obvious enough that quit a few million $$$ have been spent trying to make a go of it.  Let me see if I can make the answer a little more bear-able:

 

Let’s compare a directly-executes-bytecodes CPU vs a classic RISC chip with a JIT.

 

The hardware guys like stuff simple – after all they deal with really hard problems like real physics (which is really analog except where it’s quantum) and electron-migration and power-vs-heat curves and etc… so the simpler the better.  Their plates are full already.  And if it’s simple, they can make it low power or fast (or gradually both) by adding complexity and ingenuity over time (at the hardware level).  If you compare the *spec* for a JVM, including all the bytecode behaviors, threading behaviors, GC, etc vs the *spec* for a classic RISC – you’ll see that the RISC is hugely simpler.  The bytecode spec is *complex*; hundreds of pages long.  So complex that we know that the hardware guys are going to have to bail out in lots of corner cases (what happens on a ‘new’ when the heap is exhausted?  does the hardware do a GC?).  The RISC chip *spec* has been made simple in a way which is known to allow it to be implemented fast (although that requires complexity), and we know we can JIT good code for it fairly easily.

 

When you compare the speed & power of a CPU executing bytecodes, you’ll see lots of hardware complexity around the basic execution issues (I’m skipping on lots of obvious examples, but here’s one: the stack layout sucks for wide-issue because of direct stack dependencies).  When you try to get the same job done using classic JIT’d RISC instructions the CPU is so much simpler – that it can be made better in lots of ways (faster, deep pipes, wide issue, lower power, etc).  Of course, you have to JIT first – but that’s obviously do-able with a compiler that itself runs on a RISC. 

 

Now which is better (for the same silicon budget): JIT’ing+classic-RISC-executing or just plain execute-the-bytecodes?  Well… it all depends on the numbers.  For really short & small things, the JIT’ing loses so much that you’re better off just doing the bytecodes in hardware (but you can probably change source languages to something even more suited to lower power or smaller form).  But for anything cell-phone sized and up, JIT’ing bytecodes is both a power and speed win.  Yes you pay in time & power to JIT – but the resulting code runs so much faster that you get the job done sooner and can throttle the CPU down sooner – burning less overall power AND time.

 

Hence the best Java-optimized hardware is something that makes an easy JIT target.  After that Big Decision is made, you can further tweak the hardware to be closer to the language spec (which is what Azul did) or your intended target audience (large heap large thread Java apps, hence lots of 64-bit cores).  We also targeted another Java feature – GC – with read & write barrier hardware.  But we always start with an easy JIT target…

 

Cliff

 

Welcome to my new Blog Home

Welcome to my new Blog Home!

 

And for those of you who made it, this quick tidbit.

Everybody in the JVM business optimizes for long arraycopys/memcpy.

You can get peak bandwidth on machines with a decently long memcpy.

It’s easy enough to do and arraycopy and memcpy are called a *lot*.

But what about short arraycopies?  How often do we call System.arraycopy with small numbers?

 

In a run of JBB2000 (yah, the old one – I happen to have numbers handy for it), how many times a second is System.arraycopy called?  Yes, the answer obviously depends on your score.  Lets assume your score is middlin’-high – say 1,000,000 BOPs/sec.

 

Did you guess 30million times/sec?  That’s 30 calls to System.arraycopy per BOP on average. 

Now, how many bytes are moved on average?   – 42.5

That’s less than an x86 cache-line.

 

Getting a good score on JBB (and on many many benchmarks) depends on getting the overhead of short arraycopies reduced as much as possible.  Yes, in the end you need to do well on the rare 1Megabyte array copy… but it’s more important to copy those first few bytes with as little overhead as possible.

 

Cliff

 

 PS – We’re working on the RSS feed

 

Odds-n-Ends 2

A collection of recent questions & answers…

 

These first two comment/questions comes from this TSS thread:

Java bytecode is NOT suited to be run on real hardware. It’s stack-based, so pipelining goes out of the window. In theory, one can do on-the-fly translation from stack-based to register-based machine, but it’ll require A LOT of transistors.  So in reality, it’s ALWAYS more effective to JIT-compile Java bytecode and then run it on a common CPU.

 

Azul’s CPUs are classic 3-address RISCs, with very few special-for-Java features.  Our addressing math matches 64-bit Java math exactly (and pretty much nobody does; saves 1 op per array address math).  We allow meta-data in the pointers which the LD/ST unit strips off.  We have read & write barriers in hardware (these last are really for GC and not Java).  We have a fast-inline-cache instruction.  We have a fast-not-contended-CAS for all those not-contended Java locks.

We do NOT have special hardware to parse or execute Java bytecodes in any way.  Like the speaker pointed out, it’s plain faster and simpler to JIT.

 

Of course, hardware can still provide few features to speed up JVMs. Like hardware-assisted forwarding pointers which allow to create fast real-time compacting pauseless GC (I assume Azul hardware has this support).
 

No, our hardware assist is a read-barrier op.  The hardware detects when a GC invariant is being blown on a freshly loaded pointer – and traps to a software routine.  1 clk if nothing is wrong (by far, vastly far, the common case) and 4 clks to enter the trap routine if there’s an issue.  Software does all the fixup, including relocating objects or adjusting pointers or marking or whatever.  Generally fixup is done in <100 clks (but not always, rarely an object-copy is required). 

 

“What is Azul’s support for Stack Allocation?”

Azul’s support for stack allocation is really support for Escape Detection – fast detection on when an object is escaping a stack lifetime.  Sun has been trying to add a classic Escape Analysis (EA) to HotSpot for quite some time.  Its not on by default now.  Research from IBM some years ago showed that its really really hard to make EA effective on large program, although it works really well on lots of microbenchmarks.

 

“Do you know how Sun is implementing Tiered Compilation?”

I *think* Sun is heading for a cut-down C2 as their Tier 1 and dropping client compiler support; Tier 0 is still the interpreter and Tier 2 would be the full blow C2.  Azul is using the C1/client JIT as our Tier 1.  We insert counters in the JIT’d C1 code and profile at “C1 speed”. 

 

“Do you do any self-modifying code?”

Self-modifying code has been the HotSpot norm for years, and appears to be common in managed runtimes in general.  There are a number of cases where we can “shape” the code but want to delay final decisions on e.g. call-targets until some thread *really* makes that call.  Having the call target into the VM gives the VM a chance to intervene (do class loading & init, etc).  Then the machine call instruction is patched to point to the “full speed” target.  C1 does this kind of delayed patching more aggressively, and is willing to generate code to classes that are not loaded – so field offsets are not even known, filling in that detail after the class finally loads.  For all HotSpot JITs, call-targets come and go; classes unload; code gets re-profiled and re-JIT’d etc.  Any time a call target changes the calls leading to it get patched.

 

“I always thought that a cas operation would be implemented by the memory controller. “

Every CAS that I’m aware of is implemented in the cache-coherency protocol and not the memory manager.

Azul’s CAS is in general quite cheap: ours can ‘hit in L1 cache’ if it’s not contended.  If it hits-in-cache then it’s 3 clocks (just a read/modify/write cycle).  If it misses in cache then of course it costs whatever a cache-miss costs.

 

“Why is ref-counting hard to use in managing concurrent structures?”

The usual mistake is to put the count in the structure being managed.  The failing scenario is where one thread gets a pointer to the ref-counted structure at about the same time as the last owner is lowering the count and thus preparing to delete.  Timeline: T1 gets a ptr to the ref-counted structure. T2 has the last read-lock on the structure and is done with it.  T2 lowers the count to zero (T1 still holds a ptr to the structure but is stalled in the OS).  T2 frees the structure (T1s pointer is now stale).  Some other thread calls malloc and gets the just-freed memory holding the structure.  T1 wakes up and increments where the ref-count used to be (but is now in some other threads memory).

The other bug-a-boo is that you need to either lock the counter or use atomic CAS instructions to modify it.

 

“Hi!  I just implemented (cool concurrent Java thing X) in C!  What do you think?”

A word of caution: C has no memory model.  Instead, it has ‘implementations’.  Implementations can differ and they ESPECIALLY differ around any kind of concurrent programming.  Different C compilers and different CPUs will vary wildly on what they do with racing reads & writes.  e.g. what works using gcc 4.2 on an x86 will probably fail miserably on an ARM or even on an X86 using Intel’s reference compiler.  I personally wrote C/C++ optimizations for IBM’s Power series that would break the *Java* Memory Model, and any straightforward port of a concurrent Java program to C/C++ using those compilers would break subtly (only under high optimizations levels and high work loads).

 

In short, you probably have an *implementation* of (cool concurrent thing X) in C that works for the compiler & hardware you are using – but recompiling with a different rev of the compiler or running on different hardware will require re-testing/re-verification.  In short, you cannot rely on the language to give you reliable semantics. 

Yes, I am well aware of an ongoing effort to give C & C++ a useful memory model like Java’s.  However to support backwards compatibility the proposed spec is broken in a bunch of useful ways – e.g. for any program with data races “all bets are off”.  Many many Java programs work with benign races, and port of these algorithms to C will fall into the “all bets are off” black hole.

 

“Re: Non-Blocking Hash Map – Why not regard a null value as deleted (as opposed to a special TOMBSTONE)? “

I’ve gone back and forth on this notion.  I think a ‘null’ for deleted works.  You’d need to support a wrapped/Primed null.  I’d need to run it through the model-checker to be sure but I think it all works.  Looking at the 2008 JavaOne slides, it would drop the number of states from 6 to 5.

“Re: Non-Blocking Hash Map – I don’t fully get why during resize you allow some threads to produce a new array, while you do make other threads sleep.  Why not limit to 1 producer?”

It’s one of those tricks you don’t ‘get’ until you try a NBHM under heavy load on large systems.  If you don’t allow multiple threads to produce a new array – then you aren’t NON-blocking, because the not-allowed threads are …, well, blocked.  If you allow 1000 cpus (yes, Azul Systems makes systems with nearly that many) to resize a 1Gig array (yes we have customers with NBHM this size) and they all do it at once – you get 1000 cpus making a request for a 2Gig array – or 2 Terabytes of simultaneous allocation requests.  Bringing the sizes down to something commonly available: if 16 cpus (think dual-socket Nehalem EX) request to resize a 100Meg array you just asked for 3.2Gigs of ram – which probably fails on a 32-bit process – despite the fact that in the end you only need 200Megs of array.

 

So instead I let a few threads try it – more than one, in case the 1st thread to get the ‘honors’ is pokey about; generally 2 threads cut the odds of a bad context switch down into the noise – but if those few threads don’t Get The Job Done (e.g. allocate a new larger array) then every thread can try to resize themselves – and thus are really not blocked.

Cliff

Inline Caches and Call Site Optimization

Inline Caches solve a problem particular to Java – extremely frequent virtual calls.  C++ has virtual calls as well, but you have to ask for them (using the virtual keyword).  By default C++ calls are static.  The situation is reversed in Java: virtual calls are the default and you have to ask for static calls (with the final keyword).  However, even though most Java calls are declared as virtual, in practice very very few use the full virtual-call mechanism.  Instead, nearly all Java calls are as fast as C and C++ static calls (including being inlined when appropriate).  Here’s how it works:

At JIT’ing time, the JIT will first attempt some sort of analysis to determine the call target.  This works surprisingly well: a great many call targets can be determined with a straightforward inspection of the class hierarchy.  Note that this inspection happens are JIT-time (runtime) so only is concerned with the classes loaded at the moment (contrast this to a normal C++ situation where all possible classes linked into the program have to be inspected).  Here’s a common situation:

——————————————————–
  abstract class Picture {
    String foo();
  }
  class JPEG extends Picture {
    String foo() { …foo-ish stuff… }
  }
  abstract class PictureOnDisk extends Picture {
    String foo();
  }

….
  void somecall( Picture pic ) {
    pic.foo();
  }

——————————————————–

When JIT’ing somecall(), what can we say about the call to foo()?  foo() is not declared final, so by default it’s a virtual call.  The type of pic is statically declared as the abstract class ‘Picture’ – but there are never any direct Picture objects (because the class is abstract!) so every ‘Picture’ is really some concrete subclass of ‘Picture’.  Loaded in the system right now there is only one such subclass: JPEG.  The abstract class ‘PictureOnDisk’ doesn’t count – there are no instances of PictureOnDisk and there are no declared concrete subclasses of PictureOnDisk.

So with a little Class Hierarchy Analysis (CHA), the JIT can determine that the only possible values for ‘pic’  are instances of class JPEG or null.  After a null-check, the virtual call to foo() can be optimized into a static call to JPEG.foo() and even inlined.  Note that this optimization works until another concrete subclass of Picture is loaded, e.g. when subclass JPEGOnDisk is loaded into the system, the JIT’d code for somecall() will have to be thrown away and regenerated.

CHA is often successful, is cheap to run, and has a high payoff (allowing a static calls or even inlining).  HotSpot (and Azul) have extended the basic algorithm several times to cover even more cases (large trees of abstract-classes with but a single concrete class in the middle will be optimized; so will many interface calls).

————————–

 

What happens when CHA “fails” (i.e. reports multiple targets)?

The most common answer is to use an “inline cache“.  An inline-cache is a classic 1-entry cache compiled directly into the code.  The Key is the expected class of the ‘this’ pointer and the Value is the method to call.  The Key test is typically done by loading the actual class out of the ‘this’ pointer (a header-word load), and then comparing it to some expected class.  For a 32-bit X86 system the test looks something like this:
  cmp4i [RDI+4],#expected_klass
  jne fail


For Azul’s TXU ops we can skip the load as the klass is encoded directly in the pointer, and we have an instruction to do the test&branch in one go:
  cmpclass R0,#expected_klass

The Value is encoded directly into the call instruction:
  call foos_JITed_code

The failure case needs the address of the Inline Cache, because the code needs to be modified (e.g. to fill in an empty cache).  The easy way to get the address is to do the test after the X86 call has pushed the return PC on the stack…. but the test needs to cache an expected class on a per-call-site basis so the expected class is inlined before the X86 call:

The X86 total sequence is:
  mov4i RAX,#expected_klass
  call foos_JITed_code
foos_JITed_code:
  cmp4i [RDI+4],RAX
  jne fail

Performance: The entire sequence is only 2 to 4 instructions long (HotSpot/X86 uses 4 ops plus alignment NO-OPs, 5 on Sparc, more for 64-bit pointers; Azul TXU uses 2 ops).  95% of (static) call sites that use an inline-cache never fail the cache check for the entire duration of the program.  The remaining 5% typically fail it repeated, i.e. the call site goes megamorphic.  For the 5% case we further patch the Inline Cache to call a stub to do a full virtual-call sequence.

Back to the 95% case: the IC costs a load/compare/branch.  The branch is entirely predictable.  The load has an unfortunately miss rate (often the first use of an object’s header word), but on an O-O-O X86 processor can issue past the miss and the predicted branch and start executing the called method.  These handful of extra ops represent the entire cost of 95% of the not-analyzable virtual calls sites.  Dynamically, nearly all calls (>99%) fall into the statically-analyzable or monomorphic-at-runtime camps.  Only a tiny handful at runtime actually take a path.

IC’s also work for interface calls for essentially the same reason: interface call-sites are also almost always single-klass at runtime and once you’ve correctly guessed the ‘this’ pointer you can compute the correct target of an interface call and cache it as easily as a virtual call.

————————-

What about using larger caches?  In the 5% case, does it help to have more Key/Value pairs in the cache?  Typically no: once a call-site fails to be monomorphic it’s almost always “megamorphic” – many many targets are being called.  Instead of 1 common target, there’s frequently 20 or more.

Can we do something with profiling and the JIT?  Indeed we can: if we have profiled the call site and at JIT_time discovered that one or two targets dominate we can inline the inline-cache test with the JIT.  Inlining the test gives us more options: we can now inline on the expected-path and either slow-call on the unexpected-path or bail out to
the interpreter (here I show the control-flow diamond inlined):
  cmp4i [RDX+4],#expected_klass
  jne   go_slow
  …foo-ish stuff…
post_call:
  …
go_slow:
  call full-on-virtual-call
  jmp  post_call

Bailing out to the interpreter has the further advantage that there’s no unknown-path being folded back into the main stream.  Once we know that RDX holds an instance of JPEG we know it for the rest of the compilation.

HotSpot also implements the 2-hot-targets in the JIT; the JIT inlines 2 guard tests and then does its normal optimizations for the 2 static calls (including separate inlining decisions for each call).

————————-

Inline Caches have a ‘lifecycle’.  At Azul, we control them with a simple 5-state machine.  ICs start out life in the clean state.  Any thread executing a clean IC trampolines into the VM and the IC is patched to either the caching state or the static state.  The caching state is described above; the static state is reserved for times when the JIT did not determine a single target using CHA but the runtime can prove it.  This can happen e.g. when a class unloads during the JIT process (so multiple possible targets when the JIT begins but only a single possible target by the time the JIT finishes).  If a cached state fails even once HotSpot flips the IC into the v-call (or i-call for interfaces) state. 

Many call sites are mega-morphic during startup, but the interpreter (or C1 on Azul’s tiered VM) handles startup.  Also aggressive inlining by C2 means that many call sites are replicated and each site has a chance to cache a different expected klass.  The end result is that when a cached IC misses in the cache, that call site is nearly always going to go megamorphic and no reasonably sized cache is going to hold all the targets.  i.e., the Right Thing To Do is to patch the IC into the v-call state.

As a guard against program phase-changes where a v-call is being made where a cache would work, a background thread will periodically flip all v-call state ICs back to the clean state and give them a chance to start caching again.

Due to the requirement that ICs are always executable by racing other threads, patching ICs is tricky.  It turns out to be Not Too Hard to patch from the clean state to either static or caching, and from caching to either v-call or i-call.  But it is never safe to patch from caching, v-call or i-call back to the clean state – except when no mutator can see 1/2 of a patched IC (usually a full safepoint).  So the patch-to-clean step is done by GC threads as part of a normal GC safepoint.

 

And this blog exceeds my Daily Dose of geekiness so I better code for for awhile…

Cliff

 

 

JIT’d code calling conventions, or “answering the plea for geekyness”

A blog reader writes:

> And here my heart palpitated a little when I saw there was a new Cliff Click blog entry.  Only to find it wasn’t full of obscure technical geekery, but a list of conferences.

 

With a plea like that, how can I refuse?    🙂

 

Here’s another random tidbit of Azul/HotSpot implementation details.

 

Performance of Java programs depends in some part on the JIT’d code (and some on GC and some on the JVM runtime, etc).  Performance of JIT’d code depends in part on calling conventions: how does JIT’d code call JIT’d code?  HotSpot’s philosophy has always been that JIT’d code calls other code in much the same way as C/C++ code calls other C/C++ code.  There is a calling convention (most arguments are passed in registers) and the actual ‘call‘ instruction directly calls from JIT’d code to JIT’d code.

 

Let’s briefly contrast this to some other implementation options: some VM’s arrange the JIT’d code to pass arguments in a canonical stack layout – where the stack layout matches what a purely interpreted system would do.  This allows JIT’d code to directly intercall with non-JIT’d (i.e. interpreted) code.  This makes the implementation much easier because you don’t have to JIT ALL the code (very slow and bulky;there’s a lot of run-once code in Java where even lite-weight JIT’ing it is a waste of time).  However passing all the arguments on the stack makes the hot/common case of JIT’d code calling JIT’d code pay a speed penalty.  Compiled C/C++ code doesn’t pay this price and neither do we.

 

How do we get the best of both worlds – hot-code calls hot-code with arguments in registers, but warm-code can call cold-code and have the cold-code run in the interpreter… and the interpreter is going to pass all arguments in a canonical stack layout (matching the Java Virtual Machine Spec in almost every detail, surprise, surprise)?  We do this with ‘frame adapters’ – short snippets of code which re-pack arguments to and from the stack and registers, then trampoline off to the correct handler (the JIT’d code or the interpreter).  Time for a hypothetical X86 example…

Suppose we have some hot Java code:
  static void foo(this,somePtr,4);

 

And the JIT happens to have the ‘this’ pointer in register RAX and the ‘somePtr’ value in register RBX.  Standard 64-bit X86 calling conventions require the first 3 arguments in registers RDI, RSI, and RDX.  The JIT produces this code:

  mov8  RDI,RAX  // move ‘this’ to RDI
  mov8  RSI,RBX  // move ‘somePtr’ to RSI
  mov8i RDX,#4   // move literal #4 to RDX
  call  foo.code // call the JIT’d code for ‘foo’

Alas method ‘foo’ is fairly cold (we must have come here from some low-frequency code path) and ‘foo’ is not JIT’d.  Instead, the interpreter is going to handle this call.  So where does the interpreter expect to find call arguments?  The interpreter has to run all possible calls with all possible calling signatures and arguments – so it wants an extremely generic solution.  All arguments will be passed on the JVM’s “Java Execution Stack” – see the JVM bytecode spec – but basically its a plain stack kept in memory somewhere.  For standard Sun HotSpot this stack is usually interleaved with the normal C-style control stack; for Azul Systems we hold the interpreter stack off to one side.  For implementation geeks: it’s a split-stack layout; both stacks grow towards each other from opposite directions, but the interpreter-side stack only grows when a new interpreted frame is called.  ASCII-gram stack layout:

+———+——————————————-+
| Thread  | Interpreter                    Normal “C” |
| Local   | Stack                          Stack      |
| Storage |   Grows–>                 <–Grows       |
| 32K     |                                           |
+———+——————————————-+

 

Another tidbit: the interpreter’s state (e.g. it’s stack-pointer or top-of-stack value) is kept in the Thread Local Storage area when the interpreter isn’t actively running; i.e. we do not reserve a register for the interpreter’s stack, except when the interperter is actively running.  Also, all our stacks are power-of-2 sized and aligned; we can get the base of Thread Local Storage by masking off from the normal “C/C++” stack pointer – on X86 we mask the RSP register.

 

The interpreter expects all its incoming arguments on the interpreter-side stack, and will push a small fixed-size control frame on the normal “C” side stack.  But right now, before we actually start running the interpreter, the arguments are in registers – NOT the interpeter’s stack.  How do we get them there?  We make a ‘frame adapter’ to shuffle the arguments and the ‘frame adapter’ will call into the interpreter.  And here’s the code:

  // frame adapter for signature (ptr,ptr,int)
  // First load up the interpreter’s top-of-stack
  //  from Thread Local Storage

  mov8  rax,rsp           // Copy RSP into RAX
  and8i rax,#0xFFFFF      // Mask RAX to base of TLS
  ld8   rbx,[rax+#jexstk] // load Java Execution Stack
  // Now move args from RDI,RSI & RDX into JEX stack
  st8   [rbx+ 0],rdi
  st8   [rbx+ 8],rsi
  st8   [rbx+16],rdx
  add8i rbx,24  // Bump Java Execution stack pointer
  // Jump to the common interpreter entry point
  // RAX – base of thread-local storage
  // RBX – Java Execution Stack base
  // All args passed on the JEX stack
  jmp   #interpreter

Note that the structure of a ‘frame adapter’ only depends on the method’s calling signature.  We do indeed share ‘frame adapters’ based solely on signatures.  When running a very large Java app we typically see something on the order of 1000 unique signatures, and the adapter for each signature is generally a dozen instructions.  I.e., we’re talking maybe 50K of signatures to run the largest Java programs; these programs will typically JIT 1000x more code (50Megs of JIT’d code).

 

We need one more bit of cleverness: the interpreter needs to know *which* method is being called.  JIT’d code “knows” which method is currently executing – because the program counter is unique per JIT’d method.  If we have a PC we can reverse it (via a simple table lookup) to the Java method that the code implements.  Not so for the interpreter; the interpreter runs all methods – and so the ‘method pointer’ is variable and kept in a register – and has to be passed to the interpreter when calling it.  Our ‘frame adapter’ above doesn’t include this information.  Where do we get it from?  We use the same trick that JIT’d code uses: a unique PC that ‘knows’ which method is being called.  We need 1 unique PC for each method that can be called from JIT’d code and will run interpreted (i.e. lots of them) so what we do per-PC is really small: we load the method pointer and jump to the right ‘frame adapter’:

  mov8i RCX,#method_pointer
  jmp   frame_adapter_for_(ptr,ptr,int)

And now we put it all together.  What instructions run when warm-code calls the cold-code for method ‘foo’?  First we’re running inside the JIT’d code, but the call instruction is patched to call our tiny  stub above:

// running inside JITd code about to call foo()
  mov8  RDI,RAX  // move ‘this’ to RDI

  mov8  RSI,RBX  // move ‘somePtr’ to RSI
  mov8i RDX,#4   // move literal #4 to RDX
  call  method_stub_for_foo
// now we run the tiny stub:
  mov8i RCX,#method_pointer
  jmp   frame_adapter_for_(ptr,ptr,int)
// now we run the frame adapter
  mov8  rax,rsp         
  and8i rax,#0xFFFFF    
  ld8   rbx,[rax+#jexstk]
  st8   [rbx+ 0],rdi
  st8   [rbx+ 8],rsi
  st8   [rbx+16],rdx
  add8i rbx,24          
  // Jump to the common interpreter entry point
  // RAX – base of thread-local storage
  // RBX – Java Execution Stack base
  // RCX – method pointer
  // All args passed on the JEX stack
  jmp   #interpreter

Voila’!  In less than a dozen instructions any JIT’d call site can call into the interpreter with arguments where the interpreter expects them…. OR, crucially, call hot JIT’d code with arguments in registers where the JIT’d code expects them. 

And this is how Java’s actually implemented calling convention matches compiled C code in speed, but allows for the flexibility of calling (code,slow) non-JIT’d code.

Cliff