Un-Bear-able

More news from the Internet connected-tube-thingy:

 

Cool –

From: cmck….

I’m going to reference the blog on the landing page tomorrow. I know the readership will be more than pleased that we successfully poked the bear and got some insight from Cliff.

 

Also –

 

TheServerSide.com managed to raise the ire of Azul Systems’ Cliff Click, Jr., …

 

I’m not just a bear, I’m an irate bear!  

 

Just so’s you know – it takes a lot more than a casual question about “where’s my Java Optimized Hardware” to make me irate.  In fact, that’s a very good question – because the answer is not obvious (ok, it was obvious to me 15 years ago, but I was already both a serious compiler geek and an embedded systems guy then).  But it’s not-obvious enough that quit a few million $$$ have been spent trying to make a go of it.  Let me see if I can make the answer a little more bear-able:

 

Let’s compare a directly-executes-bytecodes CPU vs a classic RISC chip with a JIT.

 

The hardware guys like stuff simple – after all they deal with really hard problems like real physics (which is really analog except where it’s quantum) and electron-migration and power-vs-heat curves and etc… so the simpler the better.  Their plates are full already.  And if it’s simple, they can make it low power or fast (or gradually both) by adding complexity and ingenuity over time (at the hardware level).  If you compare the *spec* for a JVM, including all the bytecode behaviors, threading behaviors, GC, etc vs the *spec* for a classic RISC – you’ll see that the RISC is hugely simpler.  The bytecode spec is *complex*; hundreds of pages long.  So complex that we know that the hardware guys are going to have to bail out in lots of corner cases (what happens on a ‘new’ when the heap is exhausted?  does the hardware do a GC?).  The RISC chip *spec* has been made simple in a way which is known to allow it to be implemented fast (although that requires complexity), and we know we can JIT good code for it fairly easily.

 

When you compare the speed & power of a CPU executing bytecodes, you’ll see lots of hardware complexity around the basic execution issues (I’m skipping on lots of obvious examples, but here’s one: the stack layout sucks for wide-issue because of direct stack dependencies).  When you try to get the same job done using classic JIT’d RISC instructions the CPU is so much simpler – that it can be made better in lots of ways (faster, deep pipes, wide issue, lower power, etc).  Of course, you have to JIT first – but that’s obviously do-able with a compiler that itself runs on a RISC. 

 

Now which is better (for the same silicon budget): JIT’ing+classic-RISC-executing or just plain execute-the-bytecodes?  Well… it all depends on the numbers.  For really short & small things, the JIT’ing loses so much that you’re better off just doing the bytecodes in hardware (but you can probably change source languages to something even more suited to lower power or smaller form).  But for anything cell-phone sized and up, JIT’ing bytecodes is both a power and speed win.  Yes you pay in time & power to JIT – but the resulting code runs so much faster that you get the job done sooner and can throttle the CPU down sooner – burning less overall power AND time.

 

Hence the best Java-optimized hardware is something that makes an easy JIT target.  After that Big Decision is made, you can further tweak the hardware to be closer to the language spec (which is what Azul did) or your intended target audience (large heap large thread Java apps, hence lots of 64-bit cores).  We also targeted another Java feature – GC – with read & write barrier hardware.  But we always start with an easy JIT target…

 

Cliff

 

Welcome to my new Blog Home

Welcome to my new Blog Home!

 

And for those of you who made it, this quick tidbit.

Everybody in the JVM business optimizes for long arraycopys/memcpy.

You can get peak bandwidth on machines with a decently long memcpy.

It’s easy enough to do and arraycopy and memcpy are called a *lot*.

But what about short arraycopies?  How often do we call System.arraycopy with small numbers?

 

In a run of JBB2000 (yah, the old one – I happen to have numbers handy for it), how many times a second is System.arraycopy called?  Yes, the answer obviously depends on your score.  Lets assume your score is middlin’-high – say 1,000,000 BOPs/sec.

 

Did you guess 30million times/sec?  That’s 30 calls to System.arraycopy per BOP on average. 

Now, how many bytes are moved on average?   – 42.5

That’s less than an x86 cache-line.

 

Getting a good score on JBB (and on many many benchmarks) depends on getting the overhead of short arraycopies reduced as much as possible.  Yes, in the end you need to do well on the rare 1Megabyte array copy… but it’s more important to copy those first few bytes with as little overhead as possible.

 

Cliff

 

 PS – We’re working on the RSS feed

 

Odds-n-Ends 2

A collection of recent questions & answers…

 

These first two comment/questions comes from this TSS thread:

Java bytecode is NOT suited to be run on real hardware. It’s stack-based, so pipelining goes out of the window. In theory, one can do on-the-fly translation from stack-based to register-based machine, but it’ll require A LOT of transistors.  So in reality, it’s ALWAYS more effective to JIT-compile Java bytecode and then run it on a common CPU.

 

Azul’s CPUs are classic 3-address RISCs, with very few special-for-Java features.  Our addressing math matches 64-bit Java math exactly (and pretty much nobody does; saves 1 op per array address math).  We allow meta-data in the pointers which the LD/ST unit strips off.  We have read & write barriers in hardware (these last are really for GC and not Java).  We have a fast-inline-cache instruction.  We have a fast-not-contended-CAS for all those not-contended Java locks.

We do NOT have special hardware to parse or execute Java bytecodes in any way.  Like the speaker pointed out, it’s plain faster and simpler to JIT.

 

Of course, hardware can still provide few features to speed up JVMs. Like hardware-assisted forwarding pointers which allow to create fast real-time compacting pauseless GC (I assume Azul hardware has this support).
 

No, our hardware assist is a read-barrier op.  The hardware detects when a GC invariant is being blown on a freshly loaded pointer – and traps to a software routine.  1 clk if nothing is wrong (by far, vastly far, the common case) and 4 clks to enter the trap routine if there’s an issue.  Software does all the fixup, including relocating objects or adjusting pointers or marking or whatever.  Generally fixup is done in <100 clks (but not always, rarely an object-copy is required). 

 

“What is Azul’s support for Stack Allocation?”

Azul’s support for stack allocation is really support for Escape Detection – fast detection on when an object is escaping a stack lifetime.  Sun has been trying to add a classic Escape Analysis (EA) to HotSpot for quite some time.  Its not on by default now.  Research from IBM some years ago showed that its really really hard to make EA effective on large program, although it works really well on lots of microbenchmarks.

 

“Do you know how Sun is implementing Tiered Compilation?”

I *think* Sun is heading for a cut-down C2 as their Tier 1 and dropping client compiler support; Tier 0 is still the interpreter and Tier 2 would be the full blow C2.  Azul is using the C1/client JIT as our Tier 1.  We insert counters in the JIT’d C1 code and profile at “C1 speed”. 

 

“Do you do any self-modifying code?”

Self-modifying code has been the HotSpot norm for years, and appears to be common in managed runtimes in general.  There are a number of cases where we can “shape” the code but want to delay final decisions on e.g. call-targets until some thread *really* makes that call.  Having the call target into the VM gives the VM a chance to intervene (do class loading & init, etc).  Then the machine call instruction is patched to point to the “full speed” target.  C1 does this kind of delayed patching more aggressively, and is willing to generate code to classes that are not loaded – so field offsets are not even known, filling in that detail after the class finally loads.  For all HotSpot JITs, call-targets come and go; classes unload; code gets re-profiled and re-JIT’d etc.  Any time a call target changes the calls leading to it get patched.

 

“I always thought that a cas operation would be implemented by the memory controller. “

Every CAS that I’m aware of is implemented in the cache-coherency protocol and not the memory manager.

Azul’s CAS is in general quite cheap: ours can ‘hit in L1 cache’ if it’s not contended.  If it hits-in-cache then it’s 3 clocks (just a read/modify/write cycle).  If it misses in cache then of course it costs whatever a cache-miss costs.

 

“Why is ref-counting hard to use in managing concurrent structures?”

The usual mistake is to put the count in the structure being managed.  The failing scenario is where one thread gets a pointer to the ref-counted structure at about the same time as the last owner is lowering the count and thus preparing to delete.  Timeline: T1 gets a ptr to the ref-counted structure. T2 has the last read-lock on the structure and is done with it.  T2 lowers the count to zero (T1 still holds a ptr to the structure but is stalled in the OS).  T2 frees the structure (T1s pointer is now stale).  Some other thread calls malloc and gets the just-freed memory holding the structure.  T1 wakes up and increments where the ref-count used to be (but is now in some other threads memory).

The other bug-a-boo is that you need to either lock the counter or use atomic CAS instructions to modify it.

 

“Hi!  I just implemented (cool concurrent Java thing X) in C!  What do you think?”

A word of caution: C has no memory model.  Instead, it has ‘implementations’.  Implementations can differ and they ESPECIALLY differ around any kind of concurrent programming.  Different C compilers and different CPUs will vary wildly on what they do with racing reads & writes.  e.g. what works using gcc 4.2 on an x86 will probably fail miserably on an ARM or even on an X86 using Intel’s reference compiler.  I personally wrote C/C++ optimizations for IBM’s Power series that would break the *Java* Memory Model, and any straightforward port of a concurrent Java program to C/C++ using those compilers would break subtly (only under high optimizations levels and high work loads).

 

In short, you probably have an *implementation* of (cool concurrent thing X) in C that works for the compiler & hardware you are using – but recompiling with a different rev of the compiler or running on different hardware will require re-testing/re-verification.  In short, you cannot rely on the language to give you reliable semantics. 

Yes, I am well aware of an ongoing effort to give C & C++ a useful memory model like Java’s.  However to support backwards compatibility the proposed spec is broken in a bunch of useful ways – e.g. for any program with data races “all bets are off”.  Many many Java programs work with benign races, and port of these algorithms to C will fall into the “all bets are off” black hole.

 

“Re: Non-Blocking Hash Map – Why not regard a null value as deleted (as opposed to a special TOMBSTONE)? “

I’ve gone back and forth on this notion.  I think a ‘null’ for deleted works.  You’d need to support a wrapped/Primed null.  I’d need to run it through the model-checker to be sure but I think it all works.  Looking at the 2008 JavaOne slides, it would drop the number of states from 6 to 5.

“Re: Non-Blocking Hash Map – I don’t fully get why during resize you allow some threads to produce a new array, while you do make other threads sleep.  Why not limit to 1 producer?”

It’s one of those tricks you don’t ‘get’ until you try a NBHM under heavy load on large systems.  If you don’t allow multiple threads to produce a new array – then you aren’t NON-blocking, because the not-allowed threads are …, well, blocked.  If you allow 1000 cpus (yes, Azul Systems makes systems with nearly that many) to resize a 1Gig array (yes we have customers with NBHM this size) and they all do it at once – you get 1000 cpus making a request for a 2Gig array – or 2 Terabytes of simultaneous allocation requests.  Bringing the sizes down to something commonly available: if 16 cpus (think dual-socket Nehalem EX) request to resize a 100Meg array you just asked for 3.2Gigs of ram – which probably fails on a 32-bit process – despite the fact that in the end you only need 200Megs of array.

 

So instead I let a few threads try it – more than one, in case the 1st thread to get the ‘honors’ is pokey about; generally 2 threads cut the odds of a bad context switch down into the noise – but if those few threads don’t Get The Job Done (e.g. allocate a new larger array) then every thread can try to resize themselves – and thus are really not blocked.

Cliff

Inline Caches and Call Site Optimization

Inline Caches solve a problem particular to Java – extremely frequent virtual calls.  C++ has virtual calls as well, but you have to ask for them (using the virtual keyword).  By default C++ calls are static.  The situation is reversed in Java: virtual calls are the default and you have to ask for static calls (with the final keyword).  However, even though most Java calls are declared as virtual, in practice very very few use the full virtual-call mechanism.  Instead, nearly all Java calls are as fast as C and C++ static calls (including being inlined when appropriate).  Here’s how it works:

At JIT’ing time, the JIT will first attempt some sort of analysis to determine the call target.  This works surprisingly well: a great many call targets can be determined with a straightforward inspection of the class hierarchy.  Note that this inspection happens are JIT-time (runtime) so only is concerned with the classes loaded at the moment (contrast this to a normal C++ situation where all possible classes linked into the program have to be inspected).  Here’s a common situation:

——————————————————–
  abstract class Picture {
    String foo();
  }
  class JPEG extends Picture {
    String foo() { …foo-ish stuff… }
  }
  abstract class PictureOnDisk extends Picture {
    String foo();
  }

….
  void somecall( Picture pic ) {
    pic.foo();
  }

——————————————————–

When JIT’ing somecall(), what can we say about the call to foo()?  foo() is not declared final, so by default it’s a virtual call.  The type of pic is statically declared as the abstract class ‘Picture’ – but there are never any direct Picture objects (because the class is abstract!) so every ‘Picture’ is really some concrete subclass of ‘Picture’.  Loaded in the system right now there is only one such subclass: JPEG.  The abstract class ‘PictureOnDisk’ doesn’t count – there are no instances of PictureOnDisk and there are no declared concrete subclasses of PictureOnDisk.

So with a little Class Hierarchy Analysis (CHA), the JIT can determine that the only possible values for ‘pic’  are instances of class JPEG or null.  After a null-check, the virtual call to foo() can be optimized into a static call to JPEG.foo() and even inlined.  Note that this optimization works until another concrete subclass of Picture is loaded, e.g. when subclass JPEGOnDisk is loaded into the system, the JIT’d code for somecall() will have to be thrown away and regenerated.

CHA is often successful, is cheap to run, and has a high payoff (allowing a static calls or even inlining).  HotSpot (and Azul) have extended the basic algorithm several times to cover even more cases (large trees of abstract-classes with but a single concrete class in the middle will be optimized; so will many interface calls).

————————–

 

What happens when CHA “fails” (i.e. reports multiple targets)?

The most common answer is to use an “inline cache“.  An inline-cache is a classic 1-entry cache compiled directly into the code.  The Key is the expected class of the ‘this’ pointer and the Value is the method to call.  The Key test is typically done by loading the actual class out of the ‘this’ pointer (a header-word load), and then comparing it to some expected class.  For a 32-bit X86 system the test looks something like this:
  cmp4i [RDI+4],#expected_klass
  jne fail


For Azul’s TXU ops we can skip the load as the klass is encoded directly in the pointer, and we have an instruction to do the test&branch in one go:
  cmpclass R0,#expected_klass

The Value is encoded directly into the call instruction:
  call foos_JITed_code

The failure case needs the address of the Inline Cache, because the code needs to be modified (e.g. to fill in an empty cache).  The easy way to get the address is to do the test after the X86 call has pushed the return PC on the stack…. but the test needs to cache an expected class on a per-call-site basis so the expected class is inlined before the X86 call:

The X86 total sequence is:
  mov4i RAX,#expected_klass
  call foos_JITed_code
foos_JITed_code:
  cmp4i [RDI+4],RAX
  jne fail

Performance: The entire sequence is only 2 to 4 instructions long (HotSpot/X86 uses 4 ops plus alignment NO-OPs, 5 on Sparc, more for 64-bit pointers; Azul TXU uses 2 ops).  95% of (static) call sites that use an inline-cache never fail the cache check for the entire duration of the program.  The remaining 5% typically fail it repeated, i.e. the call site goes megamorphic.  For the 5% case we further patch the Inline Cache to call a stub to do a full virtual-call sequence.

Back to the 95% case: the IC costs a load/compare/branch.  The branch is entirely predictable.  The load has an unfortunately miss rate (often the first use of an object’s header word), but on an O-O-O X86 processor can issue past the miss and the predicted branch and start executing the called method.  These handful of extra ops represent the entire cost of 95% of the not-analyzable virtual calls sites.  Dynamically, nearly all calls (>99%) fall into the statically-analyzable or monomorphic-at-runtime camps.  Only a tiny handful at runtime actually take a path.

IC’s also work for interface calls for essentially the same reason: interface call-sites are also almost always single-klass at runtime and once you’ve correctly guessed the ‘this’ pointer you can compute the correct target of an interface call and cache it as easily as a virtual call.

————————-

What about using larger caches?  In the 5% case, does it help to have more Key/Value pairs in the cache?  Typically no: once a call-site fails to be monomorphic it’s almost always “megamorphic” – many many targets are being called.  Instead of 1 common target, there’s frequently 20 or more.

Can we do something with profiling and the JIT?  Indeed we can: if we have profiled the call site and at JIT_time discovered that one or two targets dominate we can inline the inline-cache test with the JIT.  Inlining the test gives us more options: we can now inline on the expected-path and either slow-call on the unexpected-path or bail out to
the interpreter (here I show the control-flow diamond inlined):
  cmp4i [RDX+4],#expected_klass
  jne   go_slow
  …foo-ish stuff…
post_call:
  …
go_slow:
  call full-on-virtual-call
  jmp  post_call

Bailing out to the interpreter has the further advantage that there’s no unknown-path being folded back into the main stream.  Once we know that RDX holds an instance of JPEG we know it for the rest of the compilation.

HotSpot also implements the 2-hot-targets in the JIT; the JIT inlines 2 guard tests and then does its normal optimizations for the 2 static calls (including separate inlining decisions for each call).

————————-

Inline Caches have a ‘lifecycle’.  At Azul, we control them with a simple 5-state machine.  ICs start out life in the clean state.  Any thread executing a clean IC trampolines into the VM and the IC is patched to either the caching state or the static state.  The caching state is described above; the static state is reserved for times when the JIT did not determine a single target using CHA but the runtime can prove it.  This can happen e.g. when a class unloads during the JIT process (so multiple possible targets when the JIT begins but only a single possible target by the time the JIT finishes).  If a cached state fails even once HotSpot flips the IC into the v-call (or i-call for interfaces) state. 

Many call sites are mega-morphic during startup, but the interpreter (or C1 on Azul’s tiered VM) handles startup.  Also aggressive inlining by C2 means that many call sites are replicated and each site has a chance to cache a different expected klass.  The end result is that when a cached IC misses in the cache, that call site is nearly always going to go megamorphic and no reasonably sized cache is going to hold all the targets.  i.e., the Right Thing To Do is to patch the IC into the v-call state.

As a guard against program phase-changes where a v-call is being made where a cache would work, a background thread will periodically flip all v-call state ICs back to the clean state and give them a chance to start caching again.

Due to the requirement that ICs are always executable by racing other threads, patching ICs is tricky.  It turns out to be Not Too Hard to patch from the clean state to either static or caching, and from caching to either v-call or i-call.  But it is never safe to patch from caching, v-call or i-call back to the clean state – except when no mutator can see 1/2 of a patched IC (usually a full safepoint).  So the patch-to-clean step is done by GC threads as part of a normal GC safepoint.

 

And this blog exceeds my Daily Dose of geekiness so I better code for for awhile…

Cliff

 

 

JIT’d code calling conventions, or “answering the plea for geekyness”

A blog reader writes:

> And here my heart palpitated a little when I saw there was a new Cliff Click blog entry.  Only to find it wasn’t full of obscure technical geekery, but a list of conferences.

 

With a plea like that, how can I refuse?    🙂

 

Here’s another random tidbit of Azul/HotSpot implementation details.

 

Performance of Java programs depends in some part on the JIT’d code (and some on GC and some on the JVM runtime, etc).  Performance of JIT’d code depends in part on calling conventions: how does JIT’d code call JIT’d code?  HotSpot’s philosophy has always been that JIT’d code calls other code in much the same way as C/C++ code calls other C/C++ code.  There is a calling convention (most arguments are passed in registers) and the actual ‘call‘ instruction directly calls from JIT’d code to JIT’d code.

 

Let’s briefly contrast this to some other implementation options: some VM’s arrange the JIT’d code to pass arguments in a canonical stack layout – where the stack layout matches what a purely interpreted system would do.  This allows JIT’d code to directly intercall with non-JIT’d (i.e. interpreted) code.  This makes the implementation much easier because you don’t have to JIT ALL the code (very slow and bulky;there’s a lot of run-once code in Java where even lite-weight JIT’ing it is a waste of time).  However passing all the arguments on the stack makes the hot/common case of JIT’d code calling JIT’d code pay a speed penalty.  Compiled C/C++ code doesn’t pay this price and neither do we.

 

How do we get the best of both worlds – hot-code calls hot-code with arguments in registers, but warm-code can call cold-code and have the cold-code run in the interpreter… and the interpreter is going to pass all arguments in a canonical stack layout (matching the Java Virtual Machine Spec in almost every detail, surprise, surprise)?  We do this with ‘frame adapters’ – short snippets of code which re-pack arguments to and from the stack and registers, then trampoline off to the correct handler (the JIT’d code or the interpreter).  Time for a hypothetical X86 example…

Suppose we have some hot Java code:
  static void foo(this,somePtr,4);

 

And the JIT happens to have the ‘this’ pointer in register RAX and the ‘somePtr’ value in register RBX.  Standard 64-bit X86 calling conventions require the first 3 arguments in registers RDI, RSI, and RDX.  The JIT produces this code:

  mov8  RDI,RAX  // move ‘this’ to RDI
  mov8  RSI,RBX  // move ‘somePtr’ to RSI
  mov8i RDX,#4   // move literal #4 to RDX
  call  foo.code // call the JIT’d code for ‘foo’

Alas method ‘foo’ is fairly cold (we must have come here from some low-frequency code path) and ‘foo’ is not JIT’d.  Instead, the interpreter is going to handle this call.  So where does the interpreter expect to find call arguments?  The interpreter has to run all possible calls with all possible calling signatures and arguments – so it wants an extremely generic solution.  All arguments will be passed on the JVM’s “Java Execution Stack” – see the JVM bytecode spec – but basically its a plain stack kept in memory somewhere.  For standard Sun HotSpot this stack is usually interleaved with the normal C-style control stack; for Azul Systems we hold the interpreter stack off to one side.  For implementation geeks: it’s a split-stack layout; both stacks grow towards each other from opposite directions, but the interpreter-side stack only grows when a new interpreted frame is called.  ASCII-gram stack layout:

+———+——————————————-+
| Thread  | Interpreter                    Normal “C” |
| Local   | Stack                          Stack      |
| Storage |   Grows–>                 <–Grows       |
| 32K     |                                           |
+———+——————————————-+

 

Another tidbit: the interpreter’s state (e.g. it’s stack-pointer or top-of-stack value) is kept in the Thread Local Storage area when the interpreter isn’t actively running; i.e. we do not reserve a register for the interpreter’s stack, except when the interperter is actively running.  Also, all our stacks are power-of-2 sized and aligned; we can get the base of Thread Local Storage by masking off from the normal “C/C++” stack pointer – on X86 we mask the RSP register.

 

The interpreter expects all its incoming arguments on the interpreter-side stack, and will push a small fixed-size control frame on the normal “C” side stack.  But right now, before we actually start running the interpreter, the arguments are in registers – NOT the interpeter’s stack.  How do we get them there?  We make a ‘frame adapter’ to shuffle the arguments and the ‘frame adapter’ will call into the interpreter.  And here’s the code:

  // frame adapter for signature (ptr,ptr,int)
  // First load up the interpreter’s top-of-stack
  //  from Thread Local Storage

  mov8  rax,rsp           // Copy RSP into RAX
  and8i rax,#0xFFFFF      // Mask RAX to base of TLS
  ld8   rbx,[rax+#jexstk] // load Java Execution Stack
  // Now move args from RDI,RSI & RDX into JEX stack
  st8   [rbx+ 0],rdi
  st8   [rbx+ 8],rsi
  st8   [rbx+16],rdx
  add8i rbx,24  // Bump Java Execution stack pointer
  // Jump to the common interpreter entry point
  // RAX – base of thread-local storage
  // RBX – Java Execution Stack base
  // All args passed on the JEX stack
  jmp   #interpreter

Note that the structure of a ‘frame adapter’ only depends on the method’s calling signature.  We do indeed share ‘frame adapters’ based solely on signatures.  When running a very large Java app we typically see something on the order of 1000 unique signatures, and the adapter for each signature is generally a dozen instructions.  I.e., we’re talking maybe 50K of signatures to run the largest Java programs; these programs will typically JIT 1000x more code (50Megs of JIT’d code).

 

We need one more bit of cleverness: the interpreter needs to know *which* method is being called.  JIT’d code “knows” which method is currently executing – because the program counter is unique per JIT’d method.  If we have a PC we can reverse it (via a simple table lookup) to the Java method that the code implements.  Not so for the interpreter; the interpreter runs all methods – and so the ‘method pointer’ is variable and kept in a register – and has to be passed to the interpreter when calling it.  Our ‘frame adapter’ above doesn’t include this information.  Where do we get it from?  We use the same trick that JIT’d code uses: a unique PC that ‘knows’ which method is being called.  We need 1 unique PC for each method that can be called from JIT’d code and will run interpreted (i.e. lots of them) so what we do per-PC is really small: we load the method pointer and jump to the right ‘frame adapter’:

  mov8i RCX,#method_pointer
  jmp   frame_adapter_for_(ptr,ptr,int)

And now we put it all together.  What instructions run when warm-code calls the cold-code for method ‘foo’?  First we’re running inside the JIT’d code, but the call instruction is patched to call our tiny  stub above:

// running inside JITd code about to call foo()
  mov8  RDI,RAX  // move ‘this’ to RDI

  mov8  RSI,RBX  // move ‘somePtr’ to RSI
  mov8i RDX,#4   // move literal #4 to RDX
  call  method_stub_for_foo
// now we run the tiny stub:
  mov8i RCX,#method_pointer
  jmp   frame_adapter_for_(ptr,ptr,int)
// now we run the frame adapter
  mov8  rax,rsp         
  and8i rax,#0xFFFFF    
  ld8   rbx,[rax+#jexstk]
  st8   [rbx+ 0],rdi
  st8   [rbx+ 8],rsi
  st8   [rbx+16],rdx
  add8i rbx,24          
  // Jump to the common interpreter entry point
  // RAX – base of thread-local storage
  // RBX – Java Execution Stack base
  // RCX – method pointer
  // All args passed on the JEX stack
  jmp   #interpreter

Voila’!  In less than a dozen instructions any JIT’d call site can call into the interpreter with arguments where the interpreter expects them…. OR, crucially, call hot JIT’d code with arguments in registers where the JIT’d code expects them. 

And this is how Java’s actually implemented calling convention matches compiled C code in speed, but allows for the flexibility of calling (code,slow) non-JIT’d code.

Cliff

 

Conference Season!

Ugh; I’ve got too many conferences I’ve been invited to – including several new ones this year.  Here’s the quick rundown (so far! I’ve got a few more pending, including OOPSLA and SPLASH).

Transact 2010
April 13th – http://www-ali.cs.umass.edu/~moss//transact-2010/
On the PC only, so my responsibilities are over for this one!
No time for the trip to Paris.   🙁

Transactional Memory Workshop 2010
April 30th – http://www.cs.purdue.edu/tmw2010/Welcome.html
Slides – Coming as soon as I can arrange

ISMM
June 5-6 – http://www.cs.purdue.edu/ISMM10/
Basically a really awesome GC conference.  On the PC only, but planning on attending.  Co-located with PLDI.

PLDI
June 6-10 – http://www.cs.stanford.edu/pldi10/ http://www.cs.stanford.edu/pldi10/
Premier conference on “Programming Languages, Design and Implementation”, i.e. how to make languages like Java work.  On the PC again, so there are some really good papers in there.   🙂

Uber Conf
June 14-17 – http://uberconf.com/conference/denver/2010/06/home
An industry conference instead of an academic one.  I’m giving a slew of talks.

JavaOne 2010
September 19-23 – http://www.oracle.com/us/javaonedevelop/index.html
Ok, I’ve submitted talks but it’s too soon to see if I’m a speaker.  I am curious to see how Oracle handles JavaOne.  Could be good, could be great, could be … not so good.  One thing I don’t miss about the old JavaOne is paying $2000 for a plain ham sandwich box lunch in the cafe.  Oracle could simply upgrade the food option (and keep all else the same).

JAOO 2010
October 3-8 – http://jaoo.dk/aarhus-2010/

An All-Expense-Paid trip to Denmark!  Which exact talk I give is in-flux, but likely I’ll be able to finally talk about Azul Systems’ newest product!

 

(You where perhaps looking for something technical in a Cliff Click blog?  Next time, I promise!  Right now I’m swamped working on next-gen product… random star wars quote: “stay on target….”)

 

Cliff