Another round of Micro-benchmark Advice

I ran across this article,
and since Heinz is a friend I thought I’d try to figure out what’s going on.   Here’s what I came up with:


There are 3 or 4 conflicting effects and which one dominates at any point in time “depends”.  All of the effects can be removed with some care.

  • OSR: all code is in a loop in main.  The -server compiler makes good code for hot looping methods; the next time that method is called the good code runs.  Alas, ‘main‘ is never called again.  So after a time (slowly) interpreting the code, HotSpot makes mediocore code for “the middle of the method” and does an On-Stack-Replacement of the interpreter frame for the compiled frame. The -client compiler is invoked for loop-containing methods immediately, but  makes less optimized code.  Fix: make all timing methods from modest-count outer loops which then call methods which themselves have a long trip count loop:
    • for( int i=0; i<100; i++ ) test_one();
    • void test_one() { for( int i=0; i<1000000; i++ ) do_stuff(); }
  • Profiling ends compilation: after compiling the hot loop the -server compiler notices that it’s reaching code that’s (1) never been executed and (2) full of classes that have never been loaded.  It stops compiling, and issues an “uncommon-trap” – HotSpot jargon for flipping from compiled code back to the interpreter.  The -client compiler usually compiles all the code in a method no matter how hot or cold.   Fix: Run *all* test code during the warmup period which will force all classes loaded.  Call all work methods from some top-level dispatch function which itself will be profiled, hot and compiled.
  • Inline Caches: HotSpot uses an inline-cache for calls where the compiler cannot prove only a single target can be called.  An inline-cache turns a virtual (or interface) call into a static call plus a few cycles of work.  It’s is a 1-entry cache inlined in the code; the Key is the expected class of the ‘this’ pointer, the Value is the static target method matching the Key, directly encoded as a call instruction.  As soon as you need 2+ targets for the same call site, you revert to the much more expensive dynamic lookup (load/load/load/jump-register).  Both compilers use the same runtime infrastructure, but the server compiler is more aggressive about proving a single target.  Fix: either expect the calls to be single-target and fast, OR force all calls to be multi-target and slow.  The multi-target solution is easier for this kind of test.
  • Bi-morphic (NOT poly-morphic) call site optimization: Where the -server compiler can prove only TWO classes reach a call site it will insert a type-check and then statically call both targets (which may then further inline, etc).  The -client compiler doesn’t do this optimization.  Fix: either Do or Do Not allow 2 targets for the result of calls.  Usually it’s easy to arrange for 1 target (the norm, and inlined case) OR many more than 2 targets.
  • X86 BTB: Some X86 chips include a branch-target-buffer prediction mechanism, which can sometimes predict the target of indirect branches.  Fix: this one’s harder to control, but a light-weight pseudo-random selection of targets will often defeat the hardware. i.e., make an array of Foo objects populated with various random selections of Foo subclasses, and make virtual calls against those.

Good luck with those micro-benchmarks,