On How I Went to 2008 PLDI and All You Got Was These Lousy Notes


ISMM and PLDI are in Tucson this year, in June (and CGO was in Boston in Feburary – next year I hope they swap!!!).  Outside temps are expected to hit more than 105 in the shade.  My Dad lives in Tucson, so I went a day early to visit.  He’s recovering from a diastrous surgical accident, slowly but surely.

 

The conference center is in a new resort, located in the desert. Indeed the days where very hot and dry – but mostly spent indoors.  I didn’t have very high expectations, but actually the place was very nice.  The landscaping was just superb; they had winding shaded paths running around the resort with labelling on the desert plants.  The staff was nice and competent and the bathtub was ridiculously oversized.

 

Monday I hiked the Ventanna Canyon trail.  I started at 3:30 in the afternoon – hot part of the day, I know, but I wanted to be back before dinner.  I packed a bottle of water, a broad sun hat and layered on the sunscreen.  The trail itself winds through private land for a mile or so before entering national parklands.  It’s a rocky and winding trail, crossing a dry creek bed repeatedly as is slowly climbs into the canyon.  The trail is lined with saguaro cactus, other cactus of every type, the occasional accaia tree and a surprising amount of wildlife.  I saw a dozen jackrabbits, a 3-foot snake (not a rattler), and loads of lizards and birds.  A bright red cardinal landed 3 ft from my head and stared at me; the air was loud with owl hoots and the buzzing of a kind of large heavy hornet.  Some kind of tree was popping seed pods; every few moments there’d be a small pop and the rattling of seeds bouncing around the stones.  My feet crunched on the gravel in the hot sun except where I rock-hopped across the dry creek bed.  At times the trail came right up to the rock canyon wall, other times I was wandering back and forth across a narrow valley.  The trail climbed slowly in the initial part, but as the canyon narrows it starts to climb steeply up a ridgeline. This part of the hike was slow going for me; it was still hot but with nearly a 1000ft elevation gain.  I metered my water carefully – there’d be no refills till I got back – and I took lots of rest breaks.  Finally the trail levels out by my destination – the promised Maiden Pools.  At this time of year, they are reduced to small scummy puddles covered with bugs – definitely a disappointment!  After a short break, I turned around.  The ridgeline had lifted above the canyon and I was treated with gorgous views of Tucson.  The trip down was a lot quicker than the trip up, and mostly in shade from the narrow canyon walls.  I got down about 7, nearly 3.5 hours after starting on a 6.4 mile hike (which is definitely slow going for me!).

 

I saw my Dad again on the last day; he’s finally come home from the hospital.  He’s been in the hospital for nearly 2 months, so this is a really welcome change!  He’s in good spirits and is on the mend. 

 

The trip home started out fine: a quick flight to Phoenix from Tucson, then a short layover and a flight home… except we got a slow start, and then a slower start, and then they canceled the flight while we were all sitting on the plane.  Apparently we were too slow, and San Jose’s curfew kicked in.  US Airways put us up in a Clarion hotel about 20 minutes away.  I skipped dinner thinking I could eat at home and as of this writing it’s 5am; I’m back in the airport before any food joints open, waiting for the 1st flight to San Jose of the day.


As usual with these trip reports, I summarize brutally (and only those talks I actually attend) and sort by my personal agenda.  No offense intended, and if you think I didn’t do your talk justice please correct me here!

Some Cool Stuff

 

Building The Billion Dollar GC
Bounding Leaky Programs
Limits of Heap Compaction
Register Allocation via Puzzle Solving
Register Allocation/Coalescing paper
Foundations of the C++ Concurrency Memory Model
Race Directed Random Testing of Concurrent Programs
SharC – Checking C code for Locking Invariants
Automatic Volume Management for Programmable Microfluidics
Al Aho – Quantum Computer Compilers

 

Some Memory Management Papers

 

Immix – A Mark-Region Collector
Concurrent Real-Time Copying Garbage Collection
Limits of Parallel Marking
The Closer: Automating Resource Management in Java
Path Specialization: Reducing Phased Execution Overheads
Region-based Memory Management
Efficient Dynamic Heap Allocation of Scratch-Pad Memory
Virtual-Memory-Compaction….
Supporting Superpage Allocation
MPADs
Parallel GC with a Block-Structured Heap
Memory Management for Self-Adjusting Computation
Combine Ref Counting with Functional Program
A Study of Java Object Demograpics

 

Some Concurrent Programming Papers

 

Data-Race Detection via Linear Programming
“sketch” – A Tool for Writing Concurrent Data Structures
Velodrome
Inferring Locks for Atomic Sections
Model Checking Concurrent Programs

 

Other Stuff

 

Expressive and Safe Static Reflection with MorphJ
Liquid Types
Type-Preserving Compilation for Large-Scale Optimizing Object-Oriented Compiling
Execution Indexing Problem
XMEM – Type-Safe Shared Memory for Cross-JVM Communication

TuningFork is now open-sourced on sourceforge. (IBM’s trace visualization tool and open trace format)


Building the Billion Dollar GC

 

David Bacon, Keynote

 

Metronome – worth several hundred million $$$ in revenue to IBM.

 

(David openned with an amusing story about the failure of technology – his GPS and backup GPS both failed as his sailboat approached the reefs surrounding Bermuda; weather was too poor to use a sextent reliably).

 

1999-2001, Recycler.  Ref-counting based GC.  max pause: 2.6ms, MMU 72% @ 10ms.  But has pathologies – no bound on time with large numbers of cycles, or on large cycle.  No compaction.  Required a GC processor on the side; no way to define how beefy the GC processor had to be.

 

Next project: worked really really hard to get rid of pathologies. Wanted robust performance.  Decided on Snapshot-at-Beginning because at time you declare snapshot, you bound amount of work to complete the cycle.  Was worried about incremental-barrier approach (which is exactly the problem Azul solves with our read-barrier).

 

Real-Time also means Real-Space.  So some space obsessions.  Choose a segregated free-list.  Lose a little memory to fragmentation.  Compute ‘rho’ – fragmentation lossage ratio.  Picking a ‘rho’ defines how many memory (max) can be lost to fragmentation.

 

Array-Lets – choosen to handle compaction of large arrays well (most allocators have a large-object-space which doesn’t compact). 

 

Would be very interesting to run Fragger on Metronome.

 

2001-2003: 2nd go-round.  Max pause 12ms (up from 2.6ms).  MMU 45% @20ms (vs 72%@10ms).  Much worse than last time.  BUT!!!  Very robust against poor heap shape and bad apps; immune to pathologies.

 

Now comes 3rd attempt – built into J9 (IBM production VM). (sidebar – RTSJ doesn’t “work” in a large scale, the scoped memory doesn’t “compose”.  Modules from different places, when run together, have scope unpredictable failures.  Raytheon was choking trying to use RTSJ on the next-gen Destroyer project; saw Metronome and said “we’ll buy it now”)

 

2004: went crazy trying to get it ready for Raytheon.  1st demo from Raytheon: max 872us, MMU 65@%10ms,  79%@20ms Finally IBM Software Group agrees to pick up the project (for $150M sales!)

 

2005-2006: Lots more features added: AoT compilers, timer modules, priority inheritance, etc.  Plus must make “real time friendly” all the other cruft: jvmpi/ti, debug and RAS tools, weird ref’s (soft, weak, final, JNI), heap format/dump, class libs, all the RTSJ libs, ArrayLet support, read-barrier support in class libs, documentation, etc…..

 

Plus testing 24×7, including performance testing, pause-time testing, etc.

 

Today – new multi-cpu version being worked on.  Still only available on blades – max 8cpus.  Not tried 100G heap; no idea what the limit of sustainable allocation rate. 


Bounding Leaky Programs

 

Non-mutating Mutator – Take a snapshot from a real mutator thread; run with massive profiling (e.g. all memory refs snapshot’d).  The NonMutator cannot update the heap; it’s just profiling.

 

Bounding Leaky Programs.  After a leaky program eats all of memory, most of memory is reachable – but never used again.  When it throws an OutOfMemory – just capture the OOM.  Then ‘cut’ a random pointer edge, and get to reclaim most leaky memory.  ‘pick a random’ – since most memory is leaked and dead, likely to be cutting off dead stuff.  But also can use most common heap-type and staleness data (pages never touched in a long long time).  “posion” the cut edge, so that if that pointer is actually used, you then throw the OOM you would have thrown originally.  This can post-pone OOM’s from leaky programs anywhere from 100x (“forever”) to 10x or more.

 

A measure of heap fragmentation – sum-of-(freeblocksize^2) / (sum-of-freeblocksize)^2

 

Gives a number from 0 to 1 –

  •   0==maximally_compacted,
  •   1==maximally_fragmented.

 

Can convert this number into a characteristic-free-block size.  i.e., if we coalesce all free blocks, then split back into free blocks of the same size – which size gives this metric.

 

Can I use this to decide “how likely a future alloction request will fail due to fragmentation”?  Probably…


Limits of Heap Compaction

 

Compare 20-zillion different heap-compression techniques across a wide variety of heaps, and against e.g. gzip/LZW style compression.

 

For some applications (e.g. Xalan) see heap compressions upwards of 50% or more.  Some of the techniques are cheapandeasy to use as well. Trimming trailing zeros from arrays is a big one.

 

Biggest winner is zero-based object-compresion.  Back each field with a bit; store only the non-zero fields.  Zero fields have a bit set in the backing bitfield.  Nearly 45% heap-size reduction across all heaps.

 

Used the Dacapo benchmarks+psuedojbb, 20-25 heap snapshots per benchmark.  Savings of each technique are very stable over time (all snapshots show same savings per benchmark).


Register Allocation via Puzzle Solving

 

Solving for register-pairs.  Split all live ranges maximally; split before and after EACH instruction, using a complete parallel-rename. 

 

Very graphical; ‘puzzles’ represent the short live range.  Puzzles have a top-area and a bottom-area for the incoming and outgoing registers.  Some puzzle pieces only go on the top, or bottom, or (if it’s a 2-adr live range) it will cover both.  Never have a piece that can go either on the top OR on the bottom. 

 

Tried 4 different reg allocs on X86, using LLVM 1.9.  Puzzle solving, linear scan (default in LLVM), iterated register coalescing, PBQP (partitioned boolean quadratic programs).  Had to spill and retry 5% of time.  Puzzle-solving is fast; equiv to existing well-tuned linear scan.  Need to insert about 7% more spill ops than no allocation (compare to C2’s graph coloring: about 2% of all executed ops are C2 spill ops).

 

Cliff Notes:A nice way to handle mixing register pairs ala X86 in a linear-scan allocator.  Alas, as with all linear-scan allocators the question is what happens at merge points.  L-S allocators always do well with large basic blocks, but the typical Java basic block is 5 instructions long (HotSpot “-server”, so plenty of inlining and optimization).


Register Allocation/Coalescing paper

 

conservative coalescing vs aggressive coalescing.  but coalescing breaks SSA-form chordal graph form (and chordal graphs can be colored in linear time!!!). 

 

So do not modify graph, instead modifying the coloring to get a good coalescing.  Optimistically take the existing coloring (easy, from chordal graph SSA coloring), then flip a nodes’ color to coalesce an edge – but this introduces a conflict – so again resolve the color-clash by flipping a neighbor color until no more color clashes. Sometimes have to give up a re-coloring.  And what not should be colored 1st? 

 

Conduct re-coloring as a transaction.  Mark nodes, change colors, etc. If unresolvable, then rollback.  Also, do not recolor already recolored nodes (prevent coloring cycles).  Also when picking colors, try to pick colors not in any neighbor, etc. 

 

Will attempt all nodes in the affinity graph (all nodes joined by copies) in a single affinity group (split by self-interferences), will attempt to recolor all those nodes to the same color at once.  The split agrees up front which copies will not be attempted to coalesced.

Not terribly fast, but not too bad: about N^1.2, and 1000msec for 1000 node IFG.  After coalescing, 8.4% of copy-cost is left (vs 6.7% left for PERFECT coalescing).  Overall execution speed is about 85-90% of no-coalescing.


Foundations of the C++ Concurrency Memory Model

 

Similar to JMM; if you avoid data races, then you get SC.  Explicitly: most compilers are now busted when updating bitfields. 

 

Provide atomic<T> types that allow concurrent access, without introducing dataraces.  atomics allows concurrent access and compiler optimizations on everything else.  ‘atomic’ ends up being similar to Java ‘volatile’ – but they couldn’t make the existing C++ volatile keyward do anything meaningful.

 

Looks like they are going to end up doing a good job; most of my earlier complaints appear to be answered.


Race Directed Random Testing of Concurrent Programs

 

Take a previously known tool for finding potential dataraces.  Then force scheduling to expose the races “for real”, and test to see if the race is harmful.  Works on large java programs, finds real races with low false-positive rate.  Reproducible: given the same random starting seed, execution is deterministic. 

 

Stress Testing: repeated execution.  Fails: end up testing the same schedule over and over again.  Fails: on a crash, can not repro. Fails: crashs in different environments can not be repro’d in-house.

 

So instead, force the schedule.  Pick a random thread, step 1 instruction.  Turns out to get much better randomized schedules (more schedules tried than normal stress testing).  But still does not do much possible schedule-coverage (exponential schedules) – and so does not cover buggy schedules.   

 

So prioritize search to expose bugs quickly.  Start with some existing technique to get list of potential race conditions (all the “false positives” are good now). 

 

Good news: very modest slow-downs; gives a reproducable race/crash with very high probability (reproducable, but perhaps the indicated original potential race isn’t a real race).  Found a bunch of new races in the standard JDK container libs, and in 300KLOC programs.


SharC – Checking C code for Locking Invariants

 

Programmer defines high-level sharing strategy.  Example: this data is protected by that lock, etc.  SharC soundly checks thru static and dynamic checking.  Violations include real data-race traces. 

  • private
  • readonly (read by all, no writes except init)
  • locked(l) – only accessed under lock(l)
  • dynamic – either private or readonly
  • racy – benign race, no checking

 

Use these as type annotators: 

&nbsp; &nbsp; "int readonly *private X;"
  • private: only accssed by owning thread
  • readonly; target of X can only be read

 

No non-private pts to private objects.

 

Sharc infers most annotations

 

Some common strategies:

  • locked – counters, queues, caches
  • private->readonly (init privately, goes to read-only)
  • private-lock-private – an object moving from thread to thread

 

Key Insight: if you have the sole ptr to some object, you can change the sharing stragety for an object.  SharC checks this, and allows a ‘sharing cast’ to change the sharing strategy. 

 

During runtime, SharC makes all the right guarantees with combos of ref-counting and runtime checks.  IF there is a race between 2 threads, SharC provides an error regardless of thread schedule. Needed expensive checks for ‘dynamic’ typed objecs.  Uses WPO to infer dynamic and private sharing modes.

 

Tried it on 6 small/med concurrent programs.  Added annotations, investigate errors; add more annotations, repeat until no more errors. Programs up to 350KLOC; needed 40 annotations/mods on 350KLOC. Runtime overhead varies from 2% to 14%.  Space overhead sometimes gets large (30%+).

 

Cliff Notes: Actually seems plausible that HotSpot could be annotated this way!


Automatic Volume Management for Programmable Microfluidics

 

Think “Lab on a Chip” .  Chip with micro-pumps, micro-fluids, mixers, values, sensors, heaters, reaction loops, etc.  Want to mix and match fluids for various kinds of sensors (e.g., blood glucose meter, or a hormone tester or drug discovery).

 

‘Assays’ are an life-science experiment.  Using these fluids, but NOT “computing” using fluids – doing Real Science/Chemistry.  Fluids are not computer variables; fluids can be destroyed; have a fixed volume and when it’s used it’s used up. 

 

“Program” specifies relative volumes; but “execution” needs real volumes.  Sometimes a mix is needed for several things, so need to decide how much of each fluid is being used.  Have max capcity issues; have minimum size issues (micro-pumps have a lower bound on how much they can pump).

 

Very cool problem; using compiler-like technology to solve.  Problem looks something like network-flow, but it’s actually harder (NP-hard). Able to throw various linear-programming solvers at the problem.


2008 PLDI  Keynote

 

Al Aho – Quantum Computer Compilers

 

Lots of strange algorithmic implications; some very hard problems become asymptotically simpler (e.g. factorization goes from exponential to cubic). Right now QC’s are giant physics experiments, but progress is being made on many fronts.


Immix – A Mark-Region Collector

 

Blackburn, McKinely

 

Really nice presentation.

 

GC Fundamentals – time/space tradeoff; Immix – improve this tradeoff curve

  • Allocation (e.g. free list, bump-ptr)
  • garbage type (e.g. tracing, ref-counting)
  • reclaimation (e.g. sweep-to-free, compact, evacuation)
  • 1960 – Mark-Sweep: free-list alloc, trace, sweep-to-fre
  • 1967 – Mark-Compact:
  • 1970 – Semi-Space: bump-alloc + trace + evac

 

Look at these 3 algorithms for: mutator performance, GC time, min-heap.

 

They all have spots in the curve where they are weaker than one another.

 

Add a 4th reclaimation: sweep-to-region Mark-Region: bump + trace + sweep-to-region

 

Split heap into regions.  Objects cannot span regions. After making, any unmarked regions can be freed in bulk.

 

Large regions: suffer from fragmentation, but cheap bump-ptr

 

Small regions: suffer meta-data overhead, and small-fragmentation

 

Carefully pick block sizes to match TLB boundaries and cache boundaries.  Fragmentation doesn’t hurt locality so much if you don’t span these boundaries.  Recycle partially freed blocks (from last cycle) first before working out of new free blocks. Recycle in address order (explored other orders extensively).  This mimics the mutators’ normal allocation order – i.e., continguous in order generally.

 

Will also occasionally evacuate to get rid of really bad fragmentation (but this is a whole-heap remap?).  And it’s only done rarely.

 

Extensive testing vs 9 other GCs (and Dacapo, JBB2000, JVM98). Meets or exceeds all other collectors (in a single-gen and 2-gen settings).


Concurrent Real-Time Copying Garbage Collection

  • hard RT GC is gaining support
  • multi-core support is hard

 

This work: compacts; concurrent; lock-freedom; efficiency (actually, Azul does all these – lock-freedom is very specific: where the mutator interacts with the GC it does not need to lock; important for hard RT.)

 

For Azul – want happens on a reloc trap?  Somebody locks, so that the copy only happens once.

 

So this is simpler than Stopless – “Chicken” and “Clover”. “Chicken” – fast, but may not copy all objects so allows fragmentation “Clover” – probabilitistic, but guarantees objects get copied

 

Chicken – abort copy if a mutator writes during copy.  Uses a Brookes-style forwarding ptr.  Tag the low-bit during the copy. Mutator will clear this bit atomically if it wants to write.  (ugh that’s a write-barrier!!!). 

 

So write-barrier is wait-free, and has a fast-path/slow-path. In practice, only 1% of copies get aborted.

 

Clover – probabilistic indication of per-field copy.  Steal a bit-pattern, a random one.  CAS this bit-pattern down to indicate that the field is copied.  Chance of collision with a REAL bit pattern is 1/2^128 (128-bit CAS on Intel) – infinitesimal.  Then can make the algorithm correct-but-probabilistic-lock-free by detecting if the user is user your random R.

 

Both schemes – 20% throughput overhead (barrier costs). Clover has more slowdown during the copy phase (3x slower) Chicken – no other slowdown.

 

Alas, only implemented in Bartok; no comparision to Metronome.

 

Cliff Notes – I like the abort-copy-on-write as a way to avoid stalling the writing threads.  Be nice if we could be even cheaper on the mutator – some like: TLB-write-protect the whole page. If a TLB write trap happens, record that a write happened somewhere on the page, unprotect the page and let mutator proceed.  At end of copy, the GC thread can see if any writes happened.  Since we only copy objects for pages that are mostly empty, we’d only rarely have false aborts.


Limits of Parallel Marking

 

Did a study on longest ref chain see in SpecJVM98 heaps.  Very short for 1/2 the specs (max: 40).  Some chains around length 1200 for raytrace/mtrt and javac.  Measured max heap depth every 10000 stores into the heap – so pretty darned fine-grained measurements.  No SpecJAppServer, no Dacapo.  Bummer.

 

Also compute “relative depth” – ratio between number of objects in whole heap to longest single chain.  Then can compute worse-case speedup possible if whole heap filled with max-length chains.  This reports a guaranteed level of parallelism; real runs should do better because most chains are short.

 

Cliff Notes: Doesn’t know about Azul’s spec-marking patent and idea of being able to mark any shape heap in time proportional to size-heap-over-num-cpus.  Several researchers jumped up and pointed out that the speculative approach could theoreticaly bring in the whole heap (including all dead stuff), and I countered with the random starting point approach (that you’d get an interesting amount of dead stuff a vanishingly small %-tage of the time) and we all admitted the analysis was over our collective heads.  There’s a PhD thesis in there somewhere.


The Closer: Automating Resource Management in Java

 

Resource Management in Java (outside of GC)

 

Examples: socket open/close.  Or AddListener and RemoveListner. Really track anytime you need matching create/delete calls.

 

Usual approach is manual; has usual problems (double-free, dangling).

 

Also finalizers – but might run out of e.g. file handles, long before running out of GC (happens to us all the time!).  Error happens on a fresh open, but instead we’d like to run a full GC cycle to reclaim file handles BEFORE the ‘open’ call fails.

 

Would like to displose after it’s left ‘relavent’ use.  Listeners are called when an action happens – but if the listener itself won’t ever call something else (there’s no Observer on the Listener, but the Listener is attached to the Observed), then the listener is dead.  So define ‘interest reachablility’.  Non-interest-links keep you GC alive.  ‘Interest-links’ keep your resource alive; call the dispose method when only not-interest-reachable.

 

Also want to rapidly call ‘dispose’ as soon as only not-interest- reachable.  This greatly reduces drag-time on large resources. Sometimes can do it statically, but not always.

 

User provides: primitive resources and non-interest links. CLOSER provides: higher-level resources built from auto-made dispose calls.

 

Pretty standard flow analysis approach.  See if they can track a single ‘owner’ for a resource.  If so, then either statically dispose – or single-threaded-check dispose if the single owner ever created the resource.  If multiple owners then switch to ref-counting.

 

Worked great on a large Swing app (removed all existing manual resource management; had to annotate 5 resources – other 62 resources were ‘auto-discovered’ and auto-dispose code was almost completely equal to the manual, except for fixing 1 bug.

 

Cliff Notes: I like this as a idea for simplifying higher-level resource mgt.  I’d rather hook into ‘GC’ somehow – but GC on ‘interest-links’ somehow.  Similar to how I’d like to see using compiler/data-flow opts on normal programs. Maybe this work could be done with weak-refs and finalizers and standard GC?  Interest-links are strong links, other links are weak-links and finalizers for cleanup.


Path Specialization: Reducing Phased Execution Overheads

 

It’s “efficient” code cloning, to reduce GC barrier costs.

 

Have 1 set of code when GC is idea, and no barrier is needed.  Still need polling points, so GC can change phases.

 

Have another set for each different GC phase, with the different barriers optimized inline.  With a little bit of cleverness, can fold these versions mostly together and not clone quite so much code.

 

Also claim read barrier costs are typically much much higher than has been reported in the past (common paper is “Barriers: Friend or Foe” which reports 10-20% cost for simple read barriers).  However, e.g. Stopless has barriers costing closer to 60% during some short phases, but cost is back down during most of the GC cycle. 


Region-based Memory Management

 

Check for unsafe usage of Regions (ala HotSpot) on C programs with 100KLOC.  Found a dozen errors, and another dozen false positives. Actually, quite nice – wonder if it works on HS.


Efficient Dynamic Heap Allocation of Scratch-Pad Memory

 

Allocator targets memorys less than 1Mb in size, for embedded systems.  Scratch pads are common in small machines, but also on e.g. Cell. Scratch is big enuf to be useful, small enuf to be fast.

 

But can too much stuff to go in it; get fragmentation and control problems.  Need to manage carefully.

  • manual: nightmare for programmers to manage
  • compiler/static: doesn’t make good use, no time sense
  • dynamic: still hard to use (e.g. have a fast heap and a slow heap)

 

Example: DL-malloc will use 512bytes of state to manage a 4K scratch-pad.  1/8th of fast memory is gone already!

 

SMA – 40 bytes on a 4K scratch.  Also a lot less code and faster. Uses a fast simple fat-block bitmap approach.  Can break big blocks up to avoid fragmentation.  Split blocks are power-of-2 buddy system and stored on a fixed-sized free-lists. 

 

No boundary tags.  Instead coalesce tags: whole block region has a bitmap showing free chunks of size 2, and another bitmap for chunks of size 4, 8, etc.  Makes the coalesce operation very efficient (constant time per de-alloc or maybe log-n per de-alloc to coalesce).

 

Can deferr coalesce and stick with free-list for hi-volumne sizes.

 

Can update the bitmap with CAS for allocations that do not span a word of bitmap.  If spanning a word, will need several CAS’s.  If CAS fails, then roll-back earlier allocations and retry.

 

Very much lower overhead than DL malloc on small allocations. Won’t scale well at all (part of the small-space-fast design trade-off).


Virtual-Memory-Compaction….

 

double-map physical page as a way to move objects physically without moving them virtually.  It’s memory compaction for C++!  If the page-offsets of 2 objects do not overlap, then can copy one object from one phys-page to the other, and double-map both virtual addresses into the same phys-page (and thus reclaim the copied-from phys-page).


Supporting Superpage Allocation

 

(avoiding fragmentation of physical memory)

 

There’s lots of prior work to avoid frag; none seems to work well. Define characteristics of pages:

  • movable memory,
  • reclaimable,
  • unmovable, or
  • temporary

 

Group pages together based on mobility.

 

Movable – e.g. an app page.  Copy the data to a new page, and change the apps Virt-2-Phys mapping.  File-system cache pages are temp.

 

Free-lists for each kind of memory.  Pre-split memory for each kind of memory.  Lots of details; he talks very very fast.  I can’t follow it or type that fast.

 

Turns out it’s working well enough, and will probably be in a future version of Red Hat.  No slow-down in page allocation (and often a speedup), and also able to allocation super-pages about 50% of the time.


MPADs

 

Auto convert array-of-structs into struct-of-array for better cache locality which searching across the whole dataset.

 

Pointers-to-structs turn into ptrs-to-1st-field.  All 1st-fields are now in an array, and the array is power-of-2 aligned.  1st field must also be a power-of-2 (and math is cheaper on 1st field, so pick a hot field).  Then can convert the ptr-to-1st-field address into the array index by AND-masking to the array start, and doing a sub/shr.  Other fields can be found by adding different array bases and using the just computed index.

 

Worked great on microbenchmarks; didn’t help SpecCpu … accidently turned off hardware stream prefetching on Power chips.


Parallel GC with a Block-Structured Heap

 

Simon Marlow, Simon-Peyton-Jones

 

It’s a parallel GC for Haskell.  GC talks to the OS via ‘malloc’, not ‘mmap’.  Doing this for flexibility and portability.  NO address-space layout requirements.  Grabbing OS pages (4K chunks) at a shot. Similar to HotSpot TLAB – on 4k pages. 

 

Actually grabbing ‘megablocks’ = 2^M in side and 2^M aligned.  First 2^D bytes contain “block descriptors”, then rest of 2^M holds 2^12 sizes block. 


Memory Manage for Self-Adjusting Computation

 

Large data-set which changes slowly over time. e.g. robots interact with phys environment

 

e.g. run the same program over and over again, where the input and output from run to run vary slowly over time.  Use change-propagation to make a meta-program which acts “as if” it ran the whole program from scratch – but it runs much faster, and assumes some large starting state and produces only the delta to output state.

 

Benefits: much smaller in both time and space over running the same program over and over again.  Downside: more complex (but SMJ can auto-produce such a program from the original whole-run-at-a-shot program).  Downside: lousy GC behavior.

 

So: result is indepedent from input trace, except for performance. Given a good input trace, and a changed input, can compute a new whole output efficiently.  GIven a lousy input trace, will slowly compute the new output as-if from scratch.

Has the classic bad allocation pattern: it’s basically a giant FIFO queue for the live dataset.  New allocations are long-lived; things that die are old things.


Combine Ref Counting with Functional Program

(ref counting not quite dead yet) – ref counting admits lots of interesting optimizations.  Cycles – 2 general approaches; Brownbridge – maintain invariant that any cycle has at least one ‘weak’ edge and don’t walk weak edges.  Very complex, very hard to get right.  Or LMW – trial deletion and see if a cycle is busted.  Downside: very slow if lots of cyclic live data.  Also often have large chunks of data you know isn’t cyclic (shape of code) or isn’t dead (e.g. roots).

 

So pick these invariants:

  • no cycle is entirely strong
  • weak-in + strong-out ==> there-exists strong-in

 

Once the strong-count drops to zero, the whole cycle can be nuked.

 

Maintain invar#1

  • mutator makes refs in only 3 ways:
  • root->strong; constructor-arg->strong; ditto->weak

 

invar#2:

  • delete strong ref – check for weak-in and no strong-in and strong out  
       

    • might delete last strong-in  
    • correct by making output strong edge into a weak one (and recursively)  
    • No limit on amount of work for weaking an edge  

 

This approach limited to pure functional programming – can build cycles in the constructors but NOT by set!‘ing.

Independence Theory – edge coloring is independent of other optimizations and heuristics.  “push-out” should be possible. Typically for every node reclaimed, less than 2 are scanned.


A Study of Java Object Demograpics

 

Current collectors tend to have a very simple classification scheme – short lived, long lived, immortal.  And then act accordingly: young-gen, promotion, old-gen decisions are all driven by the prior classification.

 

Real programs have a richer structure.  Objects from the libs, the app, and the JVM itself have different lifetimes.  Clusters of allocation sites are stable across program inputs; a small number of clusters dominate all allocations. 

 

Now look at individual allocation sites, and look at the lifetime of objects from those sites.  Use some cheap statistical test to see if 2 sites have similar average lifetimes.  Group all allocation sites with a greedy-algorithm based on closeness of lifetime patterns.  The top 10 of these groups, or clusters, will cover 90% of all allocation.

 

A small amount of inlining made each “hot” inlining site very deterministic. 


Data-Race Detection via Linear Programming

 

Assume a ‘capability’ for each memory location.  You need the capability to write, or a tiny fraction of it to read.  Thus many readers are allowed, but only 1 writer (and no readers if any writer). From this can define typing rules for the lambda calculus, and from that define some linear inequalities.  Can solve system of linear inequalities with, e.g. GLPK or LPSOLVE. 

 

Now try to add locks to the lambda calculus; locks “store” capbilities.  Creating a lock also has to include some capabilities. Can stll get a real lambda calc, then linear inequalities.  Able to run this on some small pthreads/C programs and find some races.  Has the usual false-positive issues (lots reported; needs some expert extensions to his lambda calculus formulation to remove the false positives).


“sketch” – A Tool for Writing Concurrent Data Structures

 

insight – implementation – validation- fails-validate – fix either insight or implementation.

 

This tool: fix the validate/implementation cycle.  User provides insight, but tool tries implementation and validation cycle. 

 

You write a partial program and a spec; ‘sketcher’ fills in the rest of the program (exhaustive search) until it gets a program which meets the spec.  Your partial program includes ‘holes’ to be filled in, and asserts to be honored.  Plus you fill in a list of sample statements to be filled into the hole (but unordered statements, plus various reg-exp expansions – so pretty flexible about what can go in the ‘hole’).

 

Their gizmo needs a counter-example input on failure.  It eats the failing example to help drive a smarter next-go-around.  Failing traces from SPIN model-checker for one program can be used to exclude whole classes of other programs which would exhibit the same bug.

 

The using constraint-solver to produce new program attempts that avoid all prior buggy traces and fits the ‘holes’ given by sketcher.


Velodrome

 

Free from race conditions is not sufficient; must have good atomic properties as well.  Can show some interleaved schedules are equivalent to some serialized execution.  Atomic code blocks are serializable.  Atomicity violations often reveal bugs.

 

Race-Freedom – as-if sequentially consistent

 

Atomicity – as-if each atomic block is executed serially

 

Want both properties.

 

Race-Freedom/Race-Detectors – e.g. Eraser (incomplete, false alarms) vs complete tools.  But Atomicity checkers only come in the “False alarm” flavor.  Bunch of these kinds of atomicity checkers.

 

Velodrome – complete checker for atomicity – no false alarms.

 

Demos a non-standard sync idiom (Dekker’s algorithm).  Shows happens-before layered over program order, plus H-B on volatiles and shared variables.  Also, anything not in a XTN is treated as operating in its own 1-instruction XTN.  If this H-B ordering has any cycles IFF then not serializable.  Velodrome detects such cycles.

 

Now: make this scale to billions of XTN’s.  Cannot keep the entire graph; too big.  Can use GC to remove XTN’s – XTN’s with no incoming H-B edges will never be part of a cycle.  Can’t get rid of XTN’s that are mid-progress.  Ref-Counting works well (since no cycles!)

 

Must optimize unary-XTN’s (single-instruction XTN’s).  If a unary XTN has no in-edges – then it will never have in-edges, so can be GC’d immediately – so actually never even allocate a node for it in the graph.  If unary has exactly 1 in-edge, can merge it into the prior XTN as well – and “1 in-edge” refers to all the nodes smushed into the same larger XTN.  Must do this to keep costs reasonable. 

 

Velodrome: part of RoadRunner framework for Java; uses BCEL.  Can also add other analysis in a pipeline on top of velodrome (e.g. Atomizer and Eraser).  Velodrome causes slowdown by 13x.  H-B XTN graphs tend to be of size ~20 nodes! 

 

Compare Velodrome to Atomizer on 14 multi-threaded java programs. Atomizer has 1/3 false warning rate (238 total warnings).  Velodrome gives 133 errors, no false alarms.  84 more errors are Atomizer false alarms, and 21 errors are only detected by Atomizer. 

 

So – take Atomizer errors, and run with Velodrome but pause the executing thread that Atomizer complains about.  Now Velodrome usually catches a concrete witness of the broken thread scheduling. 


Inferring Locks for Atomic Sections

 

This talk: compiler support for atomic sections via pessimistic concurrency; i.e. implement the ‘atomic’ keyword with a plain lock. Useful for XTN’s with non-reversable ops (i.e., I/O or wait/notify).

 

Actually grabbing fine-grained locks at the start of each atomic section corresponding to all the memory locations being touched inside the atomic region.  (how is this different from every other STM manager out there?).  Compiler framework is parameterized: can use any heap-shape analysis to guard chunks of heaps with coarse-grained locks.  Also the coarse-grained locks are kept in a hiearchy (tree order); locks closer to the root cover everything under them.

 

Performance is better relative to TL2/STM in a 80% puts R/B tree, and comparible when using 80% reads.  Still need to figure out read/write locks, but then think performance will always tie or beat a good STM. 

 

Testing methodology is flawed, so hard to tell how well it really works.


Model Checking Concurrent Programs

 

stress testing rarely finds all the bugs.  Poor selection of thread interleavings. 

 

“CHESS” – Systematically enumerate thread interleavings.  Provide some qualified coverage guarantees.  Replace OS scheduler with a “demonic” scheduler.  Want to enumerate all paths in a state-space diagram, but can’t capture program states (too big).  Very effective for acyclic state spaces, but not for cyclic spaces.

 

Problem is to get interesting schedule coverage.  e.g. spin loops are very boring to spin at; and if you spin 100 vs 101 times – do you then explore the states following each seperately?  Likely the states following the spin loops are same for looping 100 vs 101 times.  Also can’t really detect live-locks very well – they just look like bigger badder spin locks.

 

(Good) programs indicate their lack-of-progress – a thread scheduled infinitely often will yield infinitly often (via sleep, yield, thread-ends, asm(“rep nop”), etc).  A violation of the Good Samaritan assumption is a performance error.

 

Found lots of bugs (not presented in paper!)

 

Found in singularity – etc, programs from 6K to 175K. Can boot and shutdown singularity under the model-checker – found 2 bugs.

 

Now in production use at Microsoft!


Expressive and Safe Static Reflection with MorphJ

 

So far – looks like a technique to avoid replicating synchronized wrapper classes (i.e., instead of a SynchronizedList class and a SynchronizedSet and a SynchronizedMap class, you can add sync to any prior class).

 

Adds syntax to allow iteration over reflective calls:

&nbsp; for( R m(A) : T.methods ) ...

where T is an interface Type, T.methods is a list of methods in T. SO for example can declare:

&nbsp; for( R m (A) : T.methods ) 
&nbsp; &nbsp; public R m (A args) { sync(mutex) { me.m(A); } }

It’s an interesting syntatic extension for defining new classes of code subsumption and commoning… but probably not one that’s truely useful.  Much easier just to clone the code by hand.  Certainly have enuf obvious extensions that it seems more broadly applicable.  Looks like a good language hackers’ tool.  e.g., can replace reflection hacking + BCEL fairly easily; used MorphJ to replace Heirly’s auto-STM tool.

// here is "for all fields with a getter..."
&nbsp; for( T f : X.fields ; some get#f : X.methods )

Liquid Types

 

Better static typing?  Can put restrictions on actual int values as ‘types’.  Such as ‘i is never 0 so division is ok’, or a type such as ‘i such that i is < array.length’.  Problem with dependent-types: need huge type annotations.  Can’t we use inference?

 

Limited to logical qualifiers: “0<=v” or “v<len a” or “v<100”, plus conjuctions of them.  Type qualifiers limited to the usual, plus these logical quals.

 

Provable array safety; 1% annotation in some large code, 10% for a container library.  Seems to do a decent job of inferring proper types. ML type-inference, plus ‘dsolve’ – linear constraint solver. Caught an off-by-one in the bitvector library; safety check for bitblit:

&nbsp; if( c &lt; 0 || 
&nbsp; &nbsp;&nbsp; &nbsp;offset1 &lt; 0 || offset2 &lt; 0 || 
&nbsp; &nbsp;&nbsp; &nbsp;offset1+c &gt; length1 || offset2+c &gt; length2 ) ....

Error is should be “offset1+c >= length1”


Type-Preserving Compilation for Large-Scale Optimizing Object-Oriented Compiling

 

Bartok

 

Start with TAL – type-safe asm code

 

Compiler is 200KLOC; does about 40 optimizations. Generates code that is 2% slower than no TAL; compiler is 40% slower. But gen’d code is typed.

 

Carrying type-proofs through all of the optimization steps? Have a type-system for passing by-ref values. Support most optimizations; does not support inline allocation, or array-bounds checking ABCD (ABCD is not safe in the prescence of overflow).  Proof is generated by Bartok, carried along with the TAL. Proof-checker can check proof in linear time, long after optimizer is gone.  Verification takes about 2.5% of compilation time.

 

Also check that GC info is correct, that return address is immutable.

 

Not handling correctness proofs for concurrent programs.

 

Object sizes have grown alot, to hold type info – 2x more data overhead than plain optimized code (ala plain C).  Execution time is only a little slower (no inlining of allocation code); I assume allocation is considered trusted code (part of TCB). 


Execution Indexing Problem

 

same program run twice w/slightly different inputs OR 2 versions of same program OR same multi-threaded program, w/different schedules

 

How do you represent an execution point across different executions? How to set breakpoints in the same execution-point in 2 different executions?

 

Describe a grammer for the execution trace (remember Manish’s trace compression?)  Same game: it’s a grammer to describe execution.  Can compute these online, and use them to decribe execution points.  Can ask for their ‘indices’ same as asking for a call-stack.  Instrument at branches and post-dominators; counters for loops.


XMEM – Type-Safe Shared Memory for Cross-JVM Communication

 

Uses HotSpot!  Type-safe shared memory: a pool of shared memory. No shared => private-JVM ptrs allowed.  GC safe.  Direct access. Use it, e.g., for HTTP or RMI or CORBA communication between different JVMs physically located on the same box. 

 

Ala Azul’s DirectPath (but we don’t break OS protection).

 

Shared objects not distinguishable from private objects.  Mapped to same virtual address; shared objects have same pointers in each JVM. XMem uses a write-barrier to enforce no-private-ptrs in shared-heap. Otherwise shared objects are exactly the same as private ones.

 

Need Local Class Table and Global Class Table – so shared objects can have globally shared classes as well.  Shared objects then have 2 flavors of classes – the global version, and via indirect, a private “real” local version. 

 

Some special native/JNI calls to allocate objects into the shared memory; plus some conveience modes for making objects shared by default.  Can ‘clone’ objects into shared space.  Can also open sockets and read/write to get objects into shared space.

 

GC: extends card-marks to tell which private heaps point into the shared heap.  When shared heap is done, trigger a minor collection in each JVM to find roots into the shared heap.  Any tracing algorithm will now work; they did a parallel STW copy collector.  I assume one JVM did all the GC for the collection using the shared memory.

 

Support locks in the shared heap; must always inflate such locks – as they cross JVM boundaries. 

 

Overhead vs XMEM oblivious programs- 2% slowdown inside HotSpot for extra work, lock inflation, etc.

 

But communication microbenchmarks show 20x to 200x speedups.

 

Also looked at TomCat and Hsqldb – using shared memory instead of serialization (XML) over sockets – often 1000x faster.

 

Cliff Notes: They lose OS safety protections.  A broken VM can corrupt the shared memory, thus also crashing an unrelated VM using the same shared memory.  Note the difference between 2 VMs talking via serialization over sockets: a corrupt VM will cause a Java level exception in the other VM, which might be able to correctly handle the crashing VM.  Not so here.


Yet Another sampling talk, where they show that a 1% sampling rate is good enough to track all sorts of memory-trace analyses.


RADAR – dataflow analysis for datarace detection

(solves a problem Java doesn’t have)


Postlude

I’m off onto a couple of weeks vacation.  No email!  I’ll be going through Internet Addiction Withdrawal.  (No cell phone either, but I’m a visual person and hate the phone anyways).

Cliff