DaCapo is a small (but growing) research group focused on GC & Memory Management issues. They meet about every 9 months, and the meeting bounces around depending on who is hosting it. People mostly present work-in-progress, and the audience is up front with their critiques… often interrupting the speakers or getting sidetracked into loud off-topic conversations, i.e., it’s a fun boisterous crowd.
DaCapo is in the Boston area this year, held at Tufts University’s Medford campus (everyone’s heard of Harvard and MIT, but how about the Berklee College of Architecture, Fisher College, Boston University, Suffolk University and dozens and dozens more? – Boston has more colleges and universities per square mile than any other place I know). As usual for me traveling to Boston, I skipped the car and just used public transportation – the T works marvelously well (as compared to e.g. VTA; crossing the entire town from Logan Airport stuck way out in the water to Tufts some 8 miles away took 1 subway transfer and 1/2hr, while crossing San Jose on the VTA takes something like 2 hrs).
Anyways the weather was supposed to be highs of 65 and rainy with lows in the 40’s – so I packed my parka. Turned out the weather report was a little bit off, with highs in the upper 80’s and sunny. Man did I roast. But I got out and enjoyed the sun – we (the entire DaCapo group of 40 or 50 people) walked about a mile to lunch each day, plus I walked to and from my hotel & the T stop (also about a mile). Enough ruminating, on to the talks!
As usual, I pay attention to some talks and chat w/neighbors through others. I brutally summarize. I’m opinionated & narrow focused… so Caveat Emptor.
What changed? Who did it? When? Points out the flaws of using ‘diff’.
Logical-Structure diff. Auto-magically spotted systematic change patterns.
It’s a better code-review diff.
Can report diff’s as reports like “for all XXX, replace XXX with YYY except for ZZZ_XXX_ZZZ”, etc. Vastly better error reporting!!! Vastly better diff reporting!!!
Cliff Notes: We must get this gizmo and check it out!!!!
Paper report: http://users.ece.utexas.edu/~miryung/Publications/icse09-LSdiff.KimNotkin.pdf
Sample output: http://users.ece.utexas.edu/~miryung/LSDiff/carol429-430.htm
Inferred Call Path Profiling
Very low cost path profiling. Hardware counters have very little cost, but provide little context. Software is reversed: high context & high cost.
Cliff Notes: Compare this to call/ret RPC hashing.
Cliff Notes: Bond & McKinely does something similar.
Instead, look at number of bytes from main() to leaf call on stack. Basically PC & SP (offset from stack base) provides all the context we need. Dynamically detect unique PC & SP (they did it in a training run) pairs, and reverse them to get a call-stack.
Lots of duplicated stack heights, how to tell them apart? On SpecInt unique about 2/3rds of time (2/3rds of PC/SP pairs map to a single call stack). But huge spread; really sucks bad on some benchmarks.
If have a list of entire call-graph, then can remove ambiguity by growing some frames. But requires some global analysis to know which frames to grow in order to be unique. “Solved” problem with a simple greedy search. Compute all call-stacks over whole program run, pick a conflict, grow some frame, re-compute & keep if more PC/SP pairs are distinct. Repeat until it’s rare enough that can live with dups.
Cliff Notes: The whole-program ahead-of-time thing isn’t feasible for us. How to do this on-the-fly during tick-profiling? Hash PC/SP, find a tick record, 1-time-in-1024 check for dups (crawl stack the hard way and confirm call-stack), otherwise assume unique & increment tick record. If MISS in hash table, then crawl stack the hard way. If record is labeled as having dups, then crawl the call-stack & find true record?) Actually, probably expect the call stacks to be common, but the PC’s to be fairly unique. So really asking for unique RPC+SP (not PC+SP).
His experiments: collect PC/SP pairs about 100 times/sec. 0.2% slowdown (must have had careful measurements to detect that!)
Breadcrumbs – Recovering Efficiently Computed Dynamic Calling Contexts
Also builds on probabilistic calling contexts.
Given a (nearly) unique ID (say, computed at each fcn entry via a simple hash on the return PC). IN the past, reversing these unique ids: use a dictionary (very large & high cost), use a calling-context-tree (smaller, but requires you to track where you are in the CC-tree at runtime). Forward search, but undecidable and limited options for pruning.
Attempt to reverse the hash fcn (but only when we need to take the semi-unique id) and convert it to a call-stack. Straight-up reverse is no good, but is easy to guide search. Callers of leaf call themselves have structure – e.g. expect to see a call-op from the call-site, etc. Still need a list of all call-sites (generally available anyways because of inline-cache structures). His function is “p’ = (3p+c)”, where “c” is RPC? Requires a handful of instructions on function entry.
Cliff Notes: If this reverse search is successful most of the time, it says that during a profile tick we can record just the PC/SP (or RPC/SP/PC?) and then reverse them to the entire call-stack later – on demand when RTPM requests. So a ‘tick’ is no more than record a few bits (and handle buffer overflow), but no stack crawl.
Cliff Notes: combined nicely with ‘null-assignment-stack’ info, ie. collect the entire stack and ‘color’ all NULLs with a unique stack id. Then if get a NPC can report both the NPC at the crash site, but also the call-stack that assigned the NULL in the first place.
Transparent Clustering – Phil McGachey, Eliot Moss, Tony Hosking
Attempt to auto-distribute multi-threaded programs. Pure Java, no JVM hacks. No programmer involvement. Dynamic class re-writing. JVMs can vary. Hardware can vary.
All machines on a network run RuggedJ nodes. RuggedJ includes a rewriting class loader, some runtime libs, and some application-specific partioning strategy.
Rewriter: generate lots of classes & interfaces; especially hook methods & fields. Hook into the runtime. Abstract over object location; remote & local objects.
Ex: Standard class “X”. Rewrite into a X_Local and X_Stub classes. X objects on Node A are of type X_Local. On Node B, X objects are really X_Stub objects where all the calls are remote calls. Both X_Local and X_Stub implement the “X” interface. Some other Y_Local can refer to X_Local on Node A, but refer to X_Stub on Node B. If want to migrate from Node A to Node C, want to indirect the X_Locals; so for mobile objects there’s an indirection layer called “X_Proxy”, that then be switched from pointing to an X_Local vs an X_Stub (if the object migrates). Non-mobile objects skip the X_Proxy indirection layer.
Inheritance & interfaces are maintained in the rewritten classes. Static classes conform to the RuggedJ model (I assume replication is allowed?). Arrays are also wrapped to support indirection.
Constraints: Native code breaks everything. Try to not rewrite stuff referred to by native code (maybe can intercept all native calls and rewrite args?). System code is also an issue: loaded by the system loader, is not rewritten. (Can they use -Xbootclasspath?). But most code run is actually system code! (eg. String libs are very common!)
So must deal with system code without changing the JVM.
Can wrap system class objects. Must unwrap when passing to system or native code; must re-box on return. Re-wrapping is expensive: need the exact same wrapper, so need some kind of hash-lookup to do re-wrapping. High overhead, so avoided where possible.
Extend: some classes can be extended and preserve the original class & signatures. Extended class follows the RuggedJ model but also the original system model. No need to unwrap/re-wrap. Does not work e.g. when system code generates the object (since sys code makes instances of the original and not of the extended superclass).
Some system classes are only ever referenced from user code; can just dup the code out of rt.jar and make a user-class (which now gets rewritten by the normal RuggedJ model).
Immutable classes. Just replicate.
Apparently able to cluster some complex programs.
No idea about performance as compared to explicitly clustered apps.
Jikes RVM Native Threads
Was Green threads, now Native threads
(everybody else is dropping green threads too….)
Motherhood & Apple Pie.
Computer Science as an Experimental Science
Reproducibility? Experimental bias?
Show how these problems are solved in very-high-energy astrophysics.
(1) No publish without multiple colleague confirmation.
(3) MUST also now publish reproductions or non-reproductions’s of other peoples’ experiments (or lose credibility as a research entity). But bar to publication is lower, and expectations are lower. i.e., people expect to publish in-progress work
Biggest comp-sci problem now: lack of infrastructures.
Examples of infrastructure bias: green-vs-native threads, some barrier styles are hard to use (.Net embedded objects), stack-scan (OVM makes fast stack scanning at the cost of everything else being slower). HotSpot (high-value compiler, good GC, good most everything) vs JIKES.
Cliff Notes: This was actually a really fun talk (and lots of discussion) despite my paucity of notes.
Web 2.0 Profiling
Do web 2.0 workloads differ from legacy, e.g. SpecJVM98, DaCapo suite.
Benchmarks: IBM social networking software; Java PetStore 2.0, AJAX-based RIA using JEE5, uses Lucene search.
Apply workload: viewBlog (from 9 blogs), createBookmark (total 14), selectTag (PetStore 9), etc.
Workload mix can be varied, but based on commonly observed usage pattersn from internal deployment of Lotus Connections).
Wanted to remove DB from the workload, put DB on ramdisk. Same for LDAP. Used Grinder to generate load. Main server running J9/WebSphere/Lotus and also Sun Glassfish/Petstore both on 2-way power5 1.66GHz. Fairly reasonable setup for a small web/app load.
See lots more chattiness; very short fast requests, high interrupt rates, smaller requests but many more of them. Looked at low-level profiling. See fewer I-cache misses as compared to a large legacy apps. Better I-cache behavior. But data-cache miss rate is still very significant.
Also different transactions have different scaling limitations. For PetStore the Java persistence layer started falling off after 4 cpus.
“Spade” – Cluster stream processing at IBM. Continuous high-level streams. Stream arrives for weeks & months; want to process and then emit stream data continously. Want to express the problem with a language.
Priorities – fast streaming on a cluster; then Generality; finally Usability. Beyond StreamIt. Like StreamIt, works with streams that can be split & joined. Language is typed; stream contents are strongly typed.
Demystifying Magic – High-Level Low-Level Programming
Seen this talk before, but this time given by Steve Blackburn. Nice historical perspective on writing system software in C vs asm – and the parallels to writing sys software in Java vs asm.
Key Principle: Containment. Write as much as possible (99%!) in pure Java, and dive into the low level “magic” bits as little as possible.
Extensibility: Requires change quickly, languages change slowly.
Encapsulation: contain low-level semantics
Fine-grained lower of semantics: minimize impedance, seperation of concerns
Extend Java semantics; intrinsics
Scoped semantics changes
Extend types; un/boxing, ref-vs-value, architecture sizes
Pragmatic: retain syntax where possible to keep front-end tools working
Introduce raw pointer type: “Address” looks like a Java type.
E.g.: Address a = …; … = a.loadByte(offset);
Additive: “its ok here to have unchecked casts”
Subtractive: “no allocation via new allowed”, similar to HotSpot VerifyNoGC.
Allow “unboxed” types – no v-table, so it’s a bare primitive type. But this isn’t the default (but can be asked for).
Abstraction results are good: can implement a JVM on bare hardware and with a simple change essentially virtualize the same JVM inside a GC debugging harness running inside another JVM (where the virtualized JVM’s heap is just a big array in the outer implementing JVM), etc.
Grace, Safe Multithreaded Programming for C/C++ – Emery Berger
Seen the paper before, but this is the talk by Emery.
Fake sequential semantics on stock multi-threaded hardware using fork/join and process protection.
Grace is a pthreads replacement: “g++ -lpthreads” becomes “g++ -lgrace”.
Speedups: in the absence of common benign races, Grace run programs run about 10% slower than raw pthreads – but have full sequential semantics. It’s “as-if” all the parallel programs are run sequentially immediately at the thread spawn point.
CAN’T run applications where threads run forever, i.e. reactive or server apps.
Works well with fork/join, Cilk, TBB, etc.
So thread spawn becomes a unix fork with COW, use mmap for allow memory sharing. At join points, smash in join’d threads memory updates via mmap. Also need scalable thread-local malloc heap, plus aligned globals (to avoid false-sharing at the page granularity), plus some improved I/O.
Some simple measures remove nearly all false sharing. Big one: everybody mallocs into their own space. 2nd big one: spread the globals out 1 per page.
Thread-spawn is as cheap as a fork on linux (has experiments to show it). Due to thread-cpu affinity, if you spawn a bunch of threads they share a single CPU. At the low end of thread-grain-length, fork is faster than spawn because the scheduler immediately spreads the processes across CPUs, whereas threads share for awhile. So fork is actually faster than thread-spawn for awhile.
Real performance limiter for Grace is conflicts & rollbacks, and not thread-spawn overheads.
Performance is much better than e.g. STM, because after the 1st touch on a new page (and the SEGV & accounting), all accesses on that page run at full speed.
GC Assertions: Using the GC to check heap properties
Global knowledge is required to understand seemingly local properties.
order = ….;
? are all ‘orders’ dead here?
But actually ‘order’ is held by the last-customer-transaction (as an optimization, or convienvence for customer?). Leads to a leak of ‘orders’. Really: programmer does not understand program, too complex.
Add assertions to the code, and use the GC to check the property.
Cliff Notes: Azul is already gathering global heap properties during GC – but not asserts. Gathering liveness & points-to info. If points-to included a handful of sample real instances (and all the sames linked when possible), would be a very powerful way to instantly check some heap properties.
‘assert_dead(p)’ – expect to reclaim ‘p’ at the next GC cycle
Or region-collection: start_region(); … assert_all_dead();
Or shape property: assert_unshared(p);
Or ownership properties (members of a collection are ‘owned’ by the collection)
If an assert fails, then do a slow crawl and provide the full path through the heap showing the failure. Asserts only checked at GC points.
Proving Proving Correctness of Abstract Concurrency Control and Recovery – Trek Palmer & Eliot Moss
Transactions – closed nesting has issues
Open nesting & Boosting need from programmer: conflict info & roll-back info.
These are hard to provide correctly.
This work: a description language of abstract data types or structures.
Can describe the conflict predicates & inverses.
Can prove correctness of the conflict & inverse expressions.
Working with the abstract description of the data structure, and NOT e.g. with the real Java implemention. E.g., no loops, no mutations, no recursion. So the language isn’t a ‘real’ language, but can describe many kinds of ADTs in code that looks sorta like Java.
Output isn’t a functional program, instead the output is the result of using a SAT solver to prove correctness.
XTN-semantics allows conflict detection to be proven correct pairwise, instead of having to do full interleaving. Formally, a conflict
predicate is correct iff it is true when the operations do no commute.
The conflict predicate tells when 2 XTNs conflict, and the inverse allows an optimistic XTN to rollback.
Obvious use-case: use this tool to write a transactional-version of NBHM or other JDK concurrency utilities.
Hard Real-Time GC Scheduling
– Periodic Scheduling
– GC runs at highest priority, but periodically yields to mutator
– Slack-based Scheduling
– GC runs at lowest priority
– Can be preempted at any time
– Makinac (Sun RTS)
– Work based Scheduling
– GC runs at allocation time
– Problem with allocation-rate jitter
HRT – Deadlines must never be missed AND must be verifiable analytically that no deadlines are missed. Systems tend to therefore be very simple, so that they can be proven.
OVM – like Metronome. Dynamic defrag; arraylets; Brooks style barrier; replicating barrier/; incremental, concurrent, supporting slack-based style scheduling.
Understanding Performance of Interactive Applications
Typical profilers give the wrong numbers: they report total time spent, not time spent between mouse-click and page updated.
AWT/Swing & SWT – similar: single-threaded, gui thread in “event loop”, relies heavily on Listener model. So profile using gui call-back Listeners between user events and the eventual display change.
Really simple idea: profiling & event visualizer tool between click & view times.
Trouble w/refactorings: must preserve program behavior with each step. But if we want multiple refactorings then at each step along the way fixup code (needed to preserve program behavior) pollutes the code. Want to do multiple refactoring steps at once – while leaving the program possibly broken in the in-between steps.
So compute a program-semantics (names & def-use chains), and let the user make partial refactoring steps and then declare “I think I’m done” – and the system compares the new program-semantics with the old, and reports if they are not equal.
So actually comes up with primitive not-quite refactoring steps, which he will compose to build larger “real” refactorings, or compose lots to build a multi-stage refactoring.
HotSpot C2 uses a BURS framework. Very hard to optimize & debug. So these guys have a semantics which can span both the ideal IR and real machine instructions. Describe both using “CISL” and the tool does the mapping.
Once the tool has a mapping, the user provides an adapter that converts the mapping into the code needed by the compiler back-end. This would replace e.g. the ADLC. Gives examples of machine encodings for PPC and Java bytecodes.
CISL is given an IR tree and produces a target tree with equivalent semantics.
Not sure it’s entirely better than BURS (maybe no worse), but I’m still sold on rewriting in e.g. C++/Java hand-written greedy match rules. These code-gen-gen tools are pain for long-term maintenance.
Do Design & Performance Relate?
Is fast code required to be ugly? Is beautiful required to be slow?
Pick 200 code metrics. Also pick some performance metrics (cycles, single threaded, objects created, etc). Could not follow talk… or at least he wasn’t very clear with the objectives & results (if any). Interesting idea though.
Do you have any pointers to the paper — if any — that went with the Rhodes Brown talk? Googling did not yield anything useful 😉
Posted by: Andy Georges | Jun 4, 2009 12:43:54 AM
Try: “Statistically rigorous java performance evaluation” and “Wake up and smell the coffee: evaluation methodology for the 21st century” as a start.
Also this paper: “Relative factors in performance analysis” – late in section 5 shows that a minor offset in the VM’s text segment can vary performance by 6%; compiling in Intel HPM code (and not using it) can vary performance even more.
But I can’t find a reference to the result that “changing your environment variables changes your stack/text placement changes performance by 10%” – but I’ve seen the effect & worked on fixing it before (this & a related effect triggered Motorola to do a i-cache code placement optimization in ~1995).