Touching Base

It’s been awhile since I blogged, so I thought I’d touch base with people to let them know what’s been going on. Azul Systems has been hard at work improving our JVM. This is a bigger statement than it sounds – there are not many groups that have a large enough ‘quorum’ of JVM engineers to do large-scale changes to the HotSpot JVM. Azul has nearly a dozen engineers doing core HotSpot work (not counting JDK work or QA folks – counting only core JVM engineers)! We’ve been doing large-scale changes to HotSpot for nearly 8 years now. Our HotSpot has been improved over Sun’s standard HotSpot or the OpenJDK in a large number of ways, some more visible and some less so.

 

Some of the more obvious stuff we’ve got working:

 

  • A new complete replacement GC: Generational Pauseless GC (and the older PauselessGC paper is here). This is one of our core strengths. GPGC handles heaps from 60Megabytes to 600Gigabytes and allocation rates from 4Megabytes/sec to 40Gigabytes/sec, with MAX pause-times consistently down in the 10-20msec range.GPGC requires read barriers, and this means instrumenting every read from the garbage-collected heap. Instrumenting the JIT’d reads is easy: we altered the JITs long ago to emit the needed instructions. Instrumenting the VM itself is a bigger job; every time we integrate a new source drop from Sun we have to find all the new heap-reads Sun has inserted into their new C++ code (HotSpot itself is a large complex C++ program) and add read-barriers to them.
     
  • Real Time Performance Monitoring – RTPM. This is our high-resolution always-on no-overhead integrated profiling tool and is our 2nd major selling point. Because it’s no-overhead (literally less than 1%; it’s very hard to measure the overhead) we leave it always on. This means you can look at a JVM that’s been up in production for a week or a month and introspect it. It’s *common* for a 1hr session with RTPM to answer performance questions that have plagued production systems for years, or to have people walk away with 10-line fixes worth 30% speedups. It’s as-if you’ve been blind to what your JVM has been doing and suddenly your eyes are opened. Live stack traces, heap contents, leaks, hot-locks with contending stack traces, profiled JIT’d assembly, I/O bottlenecks, GC issues, etc, etc. See the link for a demo.
     
  • Virtualized JVM – We can take pretty much any old server, install a new JDK, change JAVA_HOME to the new JDK and re-launch the application… and it now runs on Azul’s JVM backed by an Azul appliance. No hardware change and no OS change. This is a great solution for in-place speedups of older gear. More recently of course, we’ve been hard at working porting our JVM to our new hardware platform. This work is going well; look for more discussion here as we have things to announce!

 

Here’s some of the LESS obvious stuff we have working:

 

  • Tiered Compilation. Despite the fact that Sun has shipped “-client” and “-server” configurations for years, they never integrated these two JITs into a single system. Most other JVMs have had a tiered compilation configuration for years and Azul Systems did this to HotSpot a few years ago. We consistently see a roughly 15% speed improvement over a plain “-server” configuration. We use the “-client” JIT (also known internally as C1) to do fast high-resolution profiling; this high-quality profile information allows the “-server” JIT (C2) to do a much better job of inlining and compiling.
     
  • A complete replacement for the existing HotSpot CodeCache: the holder of all JIT’d code in the system. While *adding* code has always been easy, *removing* code has always been tricky (well, tricky to do it without blowing all code away at once and without requiring all calls to indirect through a ‘handle’). Most large server apps slowly churn new code, so if you leak code you eventually run out of memory. The new CodeCache uses GC to control code lifetimes and this results in a vastly simpler and less buggy structure all around. We also use GC to manage all the auxiliary data structures surrounding code, e.g. the list of “class dependencies” for a piece of JIT’d code is a standard heap object now. (A “class dependency” lists the set of classes & methods that a piece of JIT’d code assumes are NOT overridden; if a new class and/or method overrides one of these then some inlining decision made by the JIT is now illegal and the JIT’d code needs to be deoptimized, removed & recompiled). Besides being a common management point for all code, the CodeCache is pinned in the low 4-Gig. This means all hardware Program Counters can be limited to 32bits (in our otherwise 64-bit system) and this is a tidy cost savings (shorter instruction sequences for calls; less I-cache space consumed, etc).
     
  • Tons of internal JVM scaling work. We run on systems with 100’s of CPUs and so we’ve found (and fixed!) any number of internal JVM scaling limitations. GPGC can run with hundreds of worker CPUs if needed. The JITs compile in parallel with dozens of CPUs (50 is common during a large application startup). Many internal VM structures have been made lock-free or have had their lock hold-times reduced by 10x or more. Self-tuning auto-sizing JIT/compiler thread pool. Concurrent stub/native-wrapper generation. Concurrent code-dependency insertion (during compilation) and checking (during class loading). Self-tuning finalizer work queues. etc, etc, etc….
     
  • Cooperative Safepointing allows thousands of *running* threads (not just alive-but-blocked-on-IO) to come to a Safepoint in under a millisecond. Merely safepointing 100’s of threads is down in the microseconds. Note that a full-on Safepoint does not happen until the last thread checks-in but the stall time starts when the first thread stops for a Safepoint. The time-to-safepoint pause is measured from when the first running thread stops till when the last thread checks-in.
     
  • The ability to asynchronously stop & signal individual threads, to have them do various self-service tasks cheaper than a remote thread can do it. This includes, e.g. stack crawls for GC or profiling (a thread’s stack is hot in his own L1 cache and can be crawled vastly faster than by a remote thread), or to acknowledge GC phase shifts or to allow code to be deoptimized (jargon word for what happens to code that is no longer valid due to class loading). We can also efficiently do “ragged safepoints” – this is like a full Safepoint except we don’t need to simultaneously stop all threads. Instead we merely need to know when all threads have acknowledged e.g. a GC phase shift. The threads “check in” as they individually acknowledge the Safepoint and keep on running. When the last thread has checked in, the “ragged safepoint” (and GC phase shift) is complete.
     
  • No more “perm-gen” space to run out or require a separate tuning flag. No more old-gen or young-gen either. No GC-thread-count knobs, or space/ratio tuning knobs or GC age or SurvivorXXX flags. GPGC takes no flags (except max total resources allowed), and runs well. There Is Only One Heap Space, and GPGC Rules It All.
     
  • A new thread & stack layout that lets us use the stack-pointer also as a ThreadLocal storage pointer, the HotSpot “JavaThread*”, AND as a small dense integer thread-id (requires 1 or 2 integer ops to flip between these forms). This frees up a CPU register for general use, while still allowing 1-cycle access to performance critical thread-local structures.
     
  • A complete replacement for the existing HotSpot locking mechanisms. Our new locks are ‘biased’ (here’s the original paper idea) similar in theory to Sun’s +BiasedLocking but based on entirely new code. No more “displaced header” madness (this comment is probably only relevant to hard-core HotSpot engineers). Biased locks do not require ANY atomic operation or memory barrier during locking & unlocking, unless the lock needs to “change hands”. Since we can stop individual threads asynchronously, we have a fairly cheap way to hand biased locks off between threads. Once individual locks demonstrate that they need to “change hands”, we inflate that one lock (not the whole class of locks) and it becomes a “thin lock” as long as the contention is low enough switching over to a “thick lock” only when there are threads waiting to acquire the lock.

    The issues here are fairly complex and subtle and deserve an entire ‘nother blog! That’s enough for this Blog. More later…