FCRC 2007

I visited FCRC 2007 and all you got was this lousy blog….


The ‘F’ in ‘FCRC’ stands for ‘Federated’ as ‘loads and loads’ of Computing Research Conferences.  2000 Computer Scientists under one roof!  The very idea is… Nerdy.  You’re allowed to surf between them (the conferences not the scientists), so it’s a chance for me to visit some conferences I’d otherwise never visit.  The actual trip was very un-inspiring (San Diego is so…. mild), so good for mental focus I suppose.


Here are some short writeups on some of the talks I visited (I walked away with many many pages of hand-written notes).  These are basically very short (brutal) summaries of the talks I attended, with a heavy Cliff slant.  Don’t expect good English, or full sentences, or even full ideas.  YMMV, Caveat Emptor, you get what you pay for, etc….


After FCRC I’m off to Barcelona for a touristic voyeurism weekend-made-for-2, while the kids have a blast at Grandpa’s house.  Oh, and I might drop in on TSS JSE and give a talk or two.  :0)





Yakking with Kathryn McKinely about TRIPS.  TRIPS is a super-duper dataflow/VLIW thingy aimed at speeding up single-thread performance with 16 semi-tightly-coupled data-flow CPUs.  She claims they’re getting close to their performance goals (or maybe: she sees the way to get there, but there’s more compiler work to go).  They have functional hardware and apparently it’s working well.



Race Detection

— Testing for Concurrent Bugs


Example: throw 100 threads at a concurrent queue, run for days, pepper code with sleeps(), etc.


Run the code repeatedly on a specialized scheduler.
Use a model-checker to avoid redundancy.
Check for assertions & deadlocks on every run.
Exhaustively try all thread interleavings.


Bound search space by trying minimal preemptions (ok if thread voluntarily yields).  i.e., explore space with ‘c’ preemptions till exhausted, then try ‘c+1’ preemptions.  State space allowed becomes polynomial in ‘c’ instead of exponential.  After choosing preemption points, scheduler gets large blocks to choose all possible interleavings = O((n*n*k)^c * n! ) – much better than before.


Able to find errors with small number of preemptions (<= 3 commonly). Some programs need lots more preemptions, but most get lots of bugs in 2-3 preemptions.  Also means coverage time is vastly improved (most bugs found almost immediately, but complete coverage still takes a long time).


— CheckFence: Checking Consistency of Concurrent Data Types on Relaxed Memory Models
Milo Martin, R Alur, S Burckhardt


Multi-threaded software; lock-free sync (intentional races).
Shared-Memory Multi with relaxed memory model.
Concurrent Execution, but not SC!


Lock-free datatypes: good for users (no locks so no deadlocks, fast, easy).
Hard to design: need fences in different places for different hardware.


Formal community: client program may be MLOC’s, but the ADT is generally tiny.


So intersect, hardware archi, computer aided verification, theorom proving


Ex: enqueue, dequeue: no locks.  Show’s M&S non-blocking Q.


Correctness condition: there exists a Witness Interleaving for the observed result.  Bounded Model Checker: user provided Memory-Model-Axioms, symbolic test.  Checker reports: pass (all executions have a serial interleaving) or fail (gives counter example) or time-out (maybe user provided Turing Machine as input so cannot prove anything).


FAIL output: complete program output with full thread interleaving and memory access values (put & get values).  Seems fairly nice for small programs to understand the failure.


Cliff: Might be able to prove correctness of the NonBlockingHashMap?


— Not All DataRaces Are Harmful


61/68 true dataraces we found in IE were benign.


Record program exec; replay & detect dataraces.  Classify every datarace using replay analysis.  Replay tries alternative orders for races.  Recording: log loaded values (claims 1% hardware overhead); using software tool for 15x slowdown. 


Once the trace is done, can do analysis offline.  So found 1000 bugs in Vista using offline analysis.  Can replay the execution, including multi-threaded.  Record 6x slower (online).  All rest are offline: Replay is 10x slower.  Datarace detection is 45x slower. Classification (benign/dangerous) is 280x slower.


Replay tries alternative orders; if the live-out values are different classify the result as a “dangerous” datarace, else classify it as “benign”.  Did not try ALL alternative orders (exponential).


Looked at 68 dataraces (16642 instances over 18 runs).  7 real datarace bugs found in heavily tested production code.  32 races correctly signaled as benign.  7/68 false positives.  Rest are questionably benign: e.g. tracking statistics; some lossage of perf counters is OK, but how much lossage is OK?


— Goldilocks: A Race and Transaction-Aware Java Runtime

Using Kaffe.
Detect DataRace’s – throw Java-level DataRaceException at runtime.
Can be caught (for optimistic execution), or otherwise program dies & programmer gets stack-traces.


Requires sound, precise, fast checking.
Current algorithms: lockset, vector-clock.


lockset-based race detection; exactly captures JMM.
Sound: detects all actual races
Precise: no false alarms
Uniformly handles all sync disciplines? (j.u.c.Locks? – yes, also wait/notify, etc)


Keep both thread-id’s and locks in locksets-per-variable.  Thread in lockset can access variable without race; so can holder of lock in the lockset.


Means each variable needs a lock-set.  Ugh.  Need some kind of “default” lock-set notion, where by default vars are only accessed by a single thread cost nothing.  (perhaps using thread-level TLB protections?)  But I understand why it works for all.


Impl: global event list instead of per-var lockset.  Lazy eval of lock-sets.  Uses fast-path checks to avoid lockset, 30%-90% of time. Global event list: means all threads log in-order their locking operations into a big list (scale limit, but not tooo bad).


Also using some static analysis to avoid checks on some accesses.


Note: will detect races that could have happened on this execution (but hardware/OS may not have interleaved).  So stronger than detecting an actual race at runtime.


Goldilocks slowdown: 2x to 5x – on Kaffe (pure interpreter so 10x slower alreaday).  Sufficient for debugging(!?!?)


Can detect races in transactions (access both inside and outside transaction).


Transactional Memory

— Private discussion with Craig Zille’s student (name?)

He’s looking at using HTM to accelerate straight-line code by allowing checks & mem-ops to happen in any order.  Exec a SPECULATE command, then start mem-ops early (better schedule), do correctness checks in any order; abort on any failure.  Commit at end.  I invited him to apply for an Azul academic account.


— SigSTM

Looking at STM/HTM hybrids.  Cost for logging reads & writes varies in software; reads are more common and need as much logging.  Commit is also expensive in software: loop to lock all; loop to verify all reads; loop to write-back data; loop to unlock.


Seeing slowdowns of single-threaded STM implementations of 50% to 7x. Throw hardware at STMRead & STMCommit.


Using read-set signatures to detect conflicts.  Hash each address read or written to some small set of bits, and set a bit in the signature. Conflicts are detected via bit-wise compare.  Because of aliasing of hash functions, you get some false conflicts – only a performance problem not a correctness problem: may do expensive full conflict detection sometimes; or simply blow up & restart transaction.


Can use hardware to do continuous read-set building & conflict detection.  Can use hardware to do outside-TM-read-detection which fixes the problem of non-atomic update (atomic between TM’s but not-atomic between a TM and a non-TM access).


Then do: load-exclusive all read lines, turn off external L1 cache access, (still validated in hardware); write-back all writes, turn on external L1 cache access.


SigTM needs 1K read-set signature bits, 128 bits for write-set.


— Performance Pathologies in HTM

Hard to compare different HTM systems; many different design decisions.  So built HTM system on common 32-core CMP model; eager-eager; eager-lazy; lazy-eager. 


Pathology: “starving elder”, “serialized commit”, “restart convoy”, “friendly fire”, etc…  a very funny nomeclature for very real problems.


LL – Lazy conflict detection, Lazy commit.  No conflict detected until somebody reaches the commit point.  Writes updates to log file, plays them back later: aborts are cheap (memory is “ok”) but commit is expensive (replay log).


“Starving Elder”: Elder thread T1 does an early load, then short T2 writes & commits – killing T1.  T1 tries again, but again T3 writes & commits first, and so on.  CAUSE: first-comer commits (not eldest first).


Fixed “restart convoy” problem with staggered restarts; get a 2x performance improvement on the LL system.


EL- Eager detect, lazy commit.  Subject to “friendly fire”- T1 starts; T2 does a store which kills T1; then T3 does a store which kills T2; then T4 does a store which kills T3, etc….  CAUSE: requestor wins. Can lead to live-lock.


EE – Eager detect, eager commit.  Aborts are slow (must roll-back memory).  On conflict, stall the requestor.  “Dueling Upgrades” – T1 loads A, then T2 loads A & wants to upgrade A.  Stalls T2…


— MetaTM & TxLinux

OS is a nice realistic workload to explore usefullness of HTM’s. txLinux – 2.6.  Converted 30% of dyanmic sync’s into transactions (spinlocks, sequence-locks).  Hacked TXN’s into a bunch of subsystems (socket locking, Zone allocator, pathname translation).


MetaTM – a HTM model; X86 extensions run on simulation.  Ton of model features – Tx demarcation, multiple-tx management, contention management, backoff policy, etc.


Ex: interrupt handler: suspend active TXN vs abort it?  Can handler use TXNs?  Choose to suspend not abort, allows more TXNs and longer ones. Note: suspend is NOT the same as NESTED.  Turns out to be  very useful to allow TXN’s in interrupt handlers: maybe 1/3 of all TXN’s in interrupt handler, 2/3rds in main kernel, very little in apps.


Stack frames: allow TXNs to cross stack frames, same as locks can. Have to handle live-stack-overwrite; have to handle stack-unwind between TXN start/end; have to handle interrupts here.


Conflict policy of “time stamp” is pretty good, but “sizematters” is better.  I think this means that “oldest TXN wins” is good, but “largest current TXN” is better.  Keeps you from killing a TXN which is going to have trouble in any case.


Also, performance is tied to commit costs, not abort costs – and confirm decision to use eager-commit.


— Speculative Optimizations Using Hardware-Monitored Guarded Regions for JVMs

Sounds like using HTM to allow spec optimizations?   Ugh, horrible presentation, and using guarded exceptions instead of simpler stuff.


Mostly small speedups, except on ‘Euler’ (JavaGrande) – probably because can only eliminate some key range-checks with this.



— Automatic Object Inlining (into HotSpot)

Co-location: objects next to each other
Inlining: replace field loads by address arithmetic
Fully automatic.


Using HotSpot -client.


Ex: Rect(Pt,Pt); memory layout:
Rect hdr
Pt-p0 hdr
x0 y0
Pt-p1 hdr
x1 y1


‘Fuse’ objects together: [Rect p0 p1 P0-hdr x0 y0 P1-hdr x1 y1]
Really guaranteed co-location


No global DFA, just use profiling; allows dynamic class loading.  Preconditions: parent & child “allocated together” (in same compilation unit+inlines?); field not modified later.  Also need copying GC & JIT of course – and deopt.


Use read-barriers to count field loads, to detect hot fields. 

Detect hot-field-loads in co-located objects. 
Do compiler analysis at allocation-time to verify the allocation invariant.
Cliff Note: must show ALL Rect constructors also produce new Points.
Cliff Note: maybe prove allocation-invariant when first compiling any
method of class Foo, by first C1 compiling all constructors of Foo.


Must track which methods depend on this, in case class-loading breaks things.


After hot-field-discovery, tell GC.  Cannot install compiled better code until GC reports that the invariant now holds.  GC uses a bit in header to recognize child-objects that are visited but not-yet-copied. Can take 2 GC cycles to straighten out all pointers.


Allocation: allocate whole structure in-1-pass, then fill in all headers and internal pointers. 


How to handle interpreter threads: alloc a Rect but not Points, then a GC happens – then allocate Pts- how to keep the objects together??? Maybe have a slow-path interpreter bit, which tells us to allocate more space?  Or Blow Out; optimization fails until no more interpreter allocation code happens?


OK for several unrelated ptrs to a child-object, but child-object has only 1 parent.


Performance: 50% on 209_db, 10% on 227_mtrt.  No big benchmarks tried.


Chatted with the author (Christien Wimmer) – he’s graduating soon from Germany, going to UC San Diego for a post-doc (w/Micheal Franz).  Not interested in industry BUT he did the NEW client compiler register allocator. 


— Online Optimizations Using Hardware Performance Monitors


Optimizing irregular access patterns amongst small objects, i.e., pointing-chasing.


Using 209_db, noticing the layout, using HPM to prefetch.   oooollddd news….


Tie cache-misses to individual field accesses.  (same as Mississippi-delta?)  e.g., taking a perf-counter tick on a specific load address, so need mapping from loads & stores back to object+field in code.  Mapping tables taking 2x to 3x machine code size.  Measuring L1 misses (maybe should measure L2 misses???).  Overhead related to sampling frequency; keep freq low to keep overhead low.


Output is sorted list of misses:
1866 – Vector->elementData
1852 – String->value
1111 – Entry->items
853 – Database->index
  37 – Vector$1->this…


Now do object co-location at next GC cycle.  Now JIT can prefetch chain of pointers based on 1st pointer.  Tried on many programs (SpecJVM98, dacapo) but only helps 209_db.  They do mention prior art (including missi.delta).


— Making it Easy to Add a JIT
IBM (J9)


Incrementally extended interpreter. 
Looks like direct-call-threading; insert calls into the interpreter for various un-JIT-implemented bytecodes.  Insert JIT code otherwise.  Can maybe inline the interpreter-impl to avoid call overhead.


Slowly turns into a simple hot-trace JIT.  Nice talk, but it’s really a cleanup of some *really old* ideas


— Dynamic Compilation: The Benefits of Early Investing

JIT early, JIT often?


Current state-of-art: seperate OS threads for compilation, running async in background.  (lightly ignores VolanoMark issues)


Points out: on multi-core or multi-procoessor, compilation is free.


Did try running Trade6: 6500 compilations


Looked at compiling at O1 instead of O0 first; hits steady-state sooner at expense of startup.  Look at forcing compilation utilization from 10% to 100% in 10% increments.  Always better to force more compilation utilization to hit steady-state perf sooner.


But significant pause-time issues for single-cpu machines.


Look at dual-cpu machine; tune compilation strategy to be more aggressive.  No gain for more compilation threads (beyond 1 at max-priority).


— A “Perfect” Efficient runtime Escape Analysis

Tracks “must” escape info – objects that in fact are touched by more than 1 thread (as opposed to Azul’s stack-alocation, which tracks when a general heap pointer points to an object). 


10x slowdown; technique used to measure quality of other EA’s, as used by various client optimizations (e.g. lock-removal).


Track an object using it’s System.hashCode; also include the gc age in the header.  Combo of hashCode and GC age are unique (assuming hashCode is related to original heap location- ala Jikes, but not HS).


Escaping location is method+bci – need to count a few levels of inlining, to at least capture “factory” notions.


Start by running several pre-existing EA’s, and intersecting the results.  These remaining locations represent places where objects are created that conservatively escape.


Now, gen thread-access-info for each touch on these objects.  Use a small direct-mapped cache to reduce data volumne; if hit in cache, no need to record more thread-touch-info.  If miss in cache, write evicted line to disk and then install touch-info into cache.  Thus can fairly efficiently track each load/store and only on objects which *might* escape (based on several conservative EA’s run ahead of time).  With small cache (100 entries) slowdown for this whole technique is only 10x.


Post-process the log data to see which objects *actually* escaped, and which locations represent a store which triggers a real escape.


Run SpecJVM98 (single-threaded benchmarks???) & JavaGrande.


Measure: speedup to removing unneeded fences, removable sync locks, measure thread-local heap allocation (for race detection: number of refs that can be removed from detection due to being single-threaded).


Ugh, all data presented on log scales; hard to understand results.


The Escape Analysis called two-phase gives the best results, compared to perfect results.  Lock removal: is the only time that two-phase does not come close to perfect (nothing else does either).


Register Allocation

— Optimistic Register Coalescing

wow… shades of HS register allocation.


Separate allocation classes for each instruction, handles register pairs in the same way.


— SSA register allocation

In SSA form, the live ranges are sub-trees of the dominance tree.  Can color greedily the subtrees; becomes a tree-scan (instead of linear-scan). 


Spilling is still the problem; pick y’re poison here.  “farthest first” & “spill everywhere”.  (Cliff: Needs my Moto hack)


Cliff: it’s graph-coloring quality for near-linear-scan cost



— CGCExplorer – Provable Correct Concurrent Collectors
David Bacon, et al


Synthesizing Concurrent Algorithms
– Design tradeoffs: simplicity for performance, e.g. fine-grained coordination
– Result: suboptimal & buggy


Given coarse-grained specs, auto-generate fine-grained concurrency


Focus on simple concurrent collectors: Dijkstra’s algorithm family; concurrent tracing non-moving (admittedly a modest setting)


Put GC algo’s into a common framework skeleton plus parametric functions. i.e., defining “mutator step”, “tracing step”, etc different for each GC.


Allow human-insight to fill-in the parametric holes.
Machine explorations combinations, verifies correctness.
Result is a verified correct GC algorithm.


Experience: human filled in blocks, 1.6M algs handed to machine, 6 correct algorithms came back – took 10 iterations.

very interesting….  read the paper.  GC algorithms generated are still fairly simplistic, but clearly heading in the right direction.


— A General Framework for Certifying GC & Mutators

well defined interface, no compomises
proven correct on 3 different concurrent collectors
Use a Hoare-style logic (SCAP Feng06]
Treat entire GC heap as an ADT
implementation-independent GC interface:
mutator only touches heap thru well defined interface: read object, write object, alloc, etc
Mutators view of the world is very abstract (no ugly GC details)


Verified a series of well-known GCs.
Needs abour 50KLOC of high-level model-checker code???


— RAM Compression

Run dataflow analysis per variable; collect set of possible values. See if the set of states can fit in smaller bit-width; if so, replace the original variable with a compressed variable.  Need code to extract/inject between the compressed & uncompressed states.


Example: array of fcn-ptrs.  A fcn-ptr is 16-bits.  After DFA, discover 4 choices of functions for each array element.  Use 2 bits instead of 16 bits to represent.  Need a compression (scan) & decompression fcn (lookup) to move between the 2 representations.


Have to merge redundant compression tables, avoid tables when possible.  Need other compression (when only 1 bit needed- it’s a constant so use c-prop).  Getting good data shrinkage; but often add code and add cpu cycles.  So only compress some variables, based on usage and cost to compress/decompress functions.


— Hierarchical Real Time GC

Based on RTJ.


RTGC’s interfere with the mutator (Metronome: tight schedule of interruptions) or requiring mutator to yield (Henriksson).


Issue: amount of RTGC interference with mutators is sum of both real-time thread and non-RT thread work.  Amounts to a kind of priority-inversion: the low non-RT thread can cause the GC to do extra work, which in turn blocks the higher RT thread.


So go hiearchical heap: a non-RT heap and a RT heap.  So can plan on the RT heap & RTGC threads on their own, and allow the non-RT thread to suffer from the non-RT GC.  This lets the non-RT GC thread to be throughput focussed, and the RT GC thread be latency aware.


These Heaps have no restrictions on pointers, just on policy.  They are only hints.  Use read-barriers to control cross-heap refs; freely allow cross-heaplet-refs.  Give different GC threads per heaplets.  For efficiency, use a heaplet hierarchy; up-refs are free.


But still need a global collector to handle cross-heaplet-cycles, but it runs very rarely with very low overhead (still not HRT?).


Seeing 15-25% improvement in peak pause times (from ~8msec to ~6msec).


Shows MMU numbers


VM Migration

— Nomad: OS-Bypass Networks for Migrating Virtual Machines

low-level VMM’s, aka Xen or VMware
idea: data communication takes place directly from user-land
high perf (>30Gps) w/low-latency (<2us).
Infiniband, Myrida, Quad….
Need virtualization features.
(InfiniBand as LID to migrate, but also NIC state: Queue Pair (QP), Completion Queue (CQ), memory keys, etc… and cannot be migrated)


Lots on how to virtualize these resources, in particular cannot let NIC-specific bits “escape” into app, but must wrap them all – which implies app changes.


Time to suspend I/O – 10ms to 65ms, depending. 
Time to free resources – 20ms to 60ms.
Remote sync 15 to 40ms.
Typical max I/O downtime: 100ms
Bad effects for simo migrating multiple VMs (sometimes needed because of tight coupling between VMs) – >250ms sometimes.


— Transparent wide-area VM migration

(owner of T-mobile).  Infra structure is *very* wide-area; distributed across geo- and political boundaries.


Example: Have some servers; running virtualization layer; want to move apps from server to server.  Why?  load-balance, reprovisioning, server evacuation, high-availability, etc.


WAN uses: disaster recovery, avoid network bottlenecks (ongoing DDOS attack), etc.  Typically does NOT work in a wide-area…


Need to make sure the filesystem remains accessible – easy in a LAN (using NAS).  Also, IP address changes in a WAN move & may break external (internet) connections.


NAS-based migration has problems: expensive, doesn’t reach outside LAN, etc.  Want to migrate local F/S somehow- avoid long-term performance degradation (as might happen if needing to access NAS storage remotely).


About 1-3sec migration in the LAN, about 70x slower in the WAN case.


Problem: migrating local storage.  Our approach: pre-copy, and record deltas.  Let VM keep running during the copy; F/S is big, copy takes long time.  Must let VM run (minimize downtime), so record deltas. Then stop VM, apply deltas at target, then restart the VM at target. Migration does take a long time, but downtime is minimized.


Migrating IP’s: use NAT & IP-tunneling to forward packets from old address to new.  New connections use dynamic DNS to find the new VM. Must keep the old XEN VM around to forward packets until old clients die off.  Good for short-lived connections.  Not ideal for long-lived connections, but tolerable in their environment.


Experiments running Apache & MySQL, migrating a running server around.  Using a 1Gig file system.  LAN speed is 100Mbps.  WAN speed is 5Mbps, 100ms.  Tried a bunch of experiments (web crawling, static web presentation, video streaming, etc). 


For streaming static web migration, migration takes ~10min, and about 1sec of measured disruption.  Network-bandwidth limited.


For dynamic web content, migration takes 4 min, and downtime is 3sec. Streaming video: built-in buffering in app mostly covered all migration.


For web server with dynamic coontent & 250 clients, migration is over an hour, downtime is 68sec – but this is still vastly better than using scp or rsync. 



— Exterminator
Gene Novak, Berger, Zorn


Builds on DieHard – memory safety for C/C++ programs.  Double/invalid free, uninitialized reads, dangling pointers, etc.  DieHard: randomly spread mallocs about with random padding, so memory errors cause probabilistic failure.  DieHard limitations: multiple errors eventually overwhelm it; cannot tolerate large overflows.


Exterminator!  Finds buffer overflows in *deterministic* programs.


Buffer overflow: allocate object too small (or copy too much?).
E.g. strcpy(new char[8],”goodbye cruel world”);


Fill free space with “canaries”; known random value.  Label each allocated object with an ID.  On crash,look for canaries; look for object ID’s near the death spot.  Re-run, with different heap layout. Again, collect ID’s near death.  With more runs, exponentially decrease odds of getting wrong OID – rapidly can name exactly which allocation is overflowing.


Dangling pointers: same object getting corrupted each time.  After a few runs, odds of confusing dangling-ptr-error and buffer-overflow gets exponentially small. 


Now make runtime patches to allocator to correct the errors: make buffer allocations bigger; delay de-allocation until the dangling ptr is over.


Overhead on SpecInt is 25% overhead on geomean.
Able to find & fix bugs in Mozilla exploit & in a small web-server.


— Shadow Every Byte of Program Memory: ValGrind

Tools that shadow every byte of memory with another value that describes it.  Lots of tools that shadow memory (gives big list).  Can find dataraces, runtime type errors, leaks, invariants, etc.


Obvious performance issues; might end up doubling memory.  Might instrument all loads & stores.


Ex: Memcheck  (but ideas broadly apply)


Thread kinds of info:  ‘A’ bit, addressibility – 1 bit per byte to indicate the memory is addressable.  Set on malloc, cleared on free, checked on load & store.


‘V’ bits, in memory & registers; tells if bit is defined. Undefined on malloc; defined after init, checked on branches but allowed to copy uninitialized around.


‘H’ bits – heap blocks, location, size, alloc info.


So every memory byte has 9 bits but 257 distinct states.
Slow simple implementation:
Break memory up into 64K chunks; using a primary-mapping entry points to a 2nd-ary map.  There’s a single no-access map to indicate all bits are not addressible.  To get at the A&V bits for a byte, use the top bits to index into the PM table, then low bits to index into the 2nd-ary map.  Unused program space uses the no-access page for compression.  Multi-byte accesses just do repeated ops per byte.  Have some bulk routines to check/set large blocks.  Slowdown: 200x.


Corruption of shadow memory: possible with buggy program.  Tried using X86 segments, but not portable.  So keep original & shadow memory far apart and pray.  64-bit machines: huge address space; need 3- or 4-level table.  Tried extending the 2-level table by 8x to handle 32G.


4 optimizations:


multi-byte loads & stores common, and N separate accesses is silly (where N=2,4,8, etc).  Slow-down: 56x


Range-setting large areas is common, and the areas can be huge.  So vector-stream-set.  Also, replace whole 2nd-ary maps with the special no-access map.  Also, no-brainer to have whole 2nd-ary pages that are fully-defined (MMAP or code-space) or fully-undefined-accessible (large malloc).  Slow-down: 34.7x


SP updates very common; inc/dec often small & statically known. Specialized inlined unrolled versions of set-range.  Slowdown: 27x


Compressed V-bits: partially defined bytes are rare.  Use 2 bits, 4 states: no-access, defined, undefined, part-defined- go elsewhere for exact V-bits.  Memory usage is over 4x less.  Slowdown 23.4x.


User experience: zero-prep is a BIG win.  But most emails are about interpreting results or bugs, not performance.


Another imple: “half-and-half” – split memory in half; keep shadow memory at constant offset.  Easy+fast to access.  Better performance, less robust (many OS’s cannot handle the very large mmap space).


— Ditto: Speeding up Runtime Data Structure Invariant Checks

Debugging: (gives scenario); suspect hashCode invariance failure: modifying ‘cart’ contents sometimes changes hashCode – which then causes hashMap to ‘lose’ a cart.


incrementalize correctness checks.
Existing incrementalizers not automatic.
This work: bytecode-to-bytecode conversion.


(1) During first run, construct graph of computation.  Store function calls & concurete inputs.
(2) Track changes to computation inputs
(3) Subsequent runs of check only rerun changes parts


(1) is easy; just abstractly run the program to build the construction graph.
(2) is harder: need to add write-barriers into the code to detect changes
(3) use the change data; assume optimistically on parts of the DS not yet checked that all is OK.  i.e., if it was correct (invariant check reported ‘ok’) before, assume correct now.


Downside: size of ‘computation graph’ is same size as original datastructure.  Upside: full invariant check in time proportional to the change delta.


— Some interesting talks on error finding & handling

Not much for me here; better error messages for C++ templates (think: STL template usage in production), Java generics, CAML, Haskell, etc.


— Practical Memory-Leak Detection using Guarded Flow Analysis

Look for leaks or double-frees in C – source-sink problem.  General resource management problem, not just memory (files, sockets, memory, etc).


Can show path from allocation to leak very concisely??? 


Using Value Flow Graph (near to SSA).  Guarded SSA – annotate data-flow edges with guards.  Then combine guards at CFG split points (merge points going backwards); if the guard at the alloc site is ‘true’ then memory is always freed going forward.


Build CFG; build VFG; compute relavent slice, compute guards, guard reasoning (using SAT), then print error messages.


Give up if alloc leaks into some aggegrate or global structure.  Some decent heuristics for pruning errors (do not bother to report leaks in ‘main’ for instance).



— Optimistic Parallelism Requires Abstraction: Galois
Milind Kulkarni, Keshav Pingali, …


Parallel programming is hard; works for data-parallel.  Compilers not able to handle irregular apps.


Irregular apps have worklist-style parallism.
Optimistic parallization is crucial; can be hidden.
High-level app semantics can be used.


Cliff Notes: similar approach to running multiple iterations of a loop in parallel using TM to roll-back on conflicts.  Pass out jobs from the worklist in parallel; run each using TM (except 1st).  Commit them in-order.  No, it’s better than that: uses app-info to allow some conflicts.


His approach looks like app-aware STM.  All access to data-structures are through the Galois run-time.  Master thread runs the top “real” iteration; master thread launches helper threads to run optimistic iterations. 


Needs inverse-operators.  App code uses normal versions of code.
Runtime will detect violations, and use inverse-ops to roll-back.


I like this approach a lot: it exposes high-level app knowledge to the runtime, and is very close to exposing it to the compiler.


Related work:
Weihl, 1988
Rinard & Diniz 1996
Wu & Padua, 1998 – exploiting semantic properties of containers in compilation
Ni et al, 2007 – open nesting using abstract locks


— Software Behavior Oriented Parallization
Xipeng Shen, Chen Ding


Possibly-Parallel-Regions – programmer marks region “BeginPPR/EndPPR”. Just hints, no harm to correctness.  Can handle 3rd-party libs, exceptional exits, etc.


So far sounds like hints to a STM, used to run code in parallel as opposed to being atomic.  They allow the master thread to jump about, in case a spec thread gets ahead of master thread and is ready to commit self.  This means the master thread & spec thread might race to complete the same code, and thus the program is not SLOWER than just running the master thread alone.


They use compilers to look at code and find memory ops they can privatize.  Using OS processes(!) for concurrent execution: this allows easy abort (kill process), strong isolation (page protection), expecting to optimize *very* coarse grained parallism in sequential code.


Able to get 2x speedup on SpecInt gzip, parser, xlisp – using about 8 cpus.  Getting good speedups for 4 & 8 CPU system.  Idea: parallism is very coarse grained (process-launch overhead) but runs very effeciently (no communication between processes till end of marked region).


Key: selecting regions that are coarse enough & have enough independent work.  Finding & fixing the dependencies that prevent parallelization (runtime will abort parallel execution & report cause).


— Effective Automatic Parallelization of Stencil Computations

Stencil: sweep thru large daaset; multiple time iterations; simple load-balanced schedule.  Tiling is essential to improve data locality. Must deal with inter-tile dependencies, skewed iteration spaces, pipelined execution.


— Cache Replacement Policy
Mohamed Zahran


look at integrated policy-shift decisions; inform both l1 & L2 at same time.  assume fixed block size, caches communicate policy somehow


(1) if access at L1 is mru, then tell L2 to update lru stack (filted update of L2)
(2) do not replace lru at L2 that also is mru in some L1
(3) replace lru at L2 that has corresponding L1 block in the same set as the incoming L2 – avoid supurious set evicts


Hope for decrease in L1 miss-per-kilo-instruction.  Results discourging.  But Spec2000 is not big enuf benchmark to study; swallowed in cache.  Working on larger benchmarks.  BUT… the “harmful scenarios” of LRU in L2 evicting a non-LRU in L1 are fairly rare; this optimization is optimizing the rare case.



Engineering a Hash Table

This blog is about some of my thinking that went into the “non-state-machine” parts of the Non-Blocking Hash Table.  As with all engineering, “it depends”.  What makes sense in one context can lose out in another situation.  Your Mileage May Value, Caveat Emptor, etc.


Bruce Holt writes:

But two things give me the sh*ts!  Non-prime table size, and linear probing.


I see how linear probing benefits your algorithm, especially as having a link field would expand your state diagram quite a lot. I guess you must run at a pretty low load factor — I’m working on very memory-constrained devices and want to keep my load factor high.


But what really makes me nervous is the non-prime table size. That means that if you’re getting a lot of collisions to the same bucket and chaining a lot then you still will after a resize – the only thing that will improve is that if you’re lucky those collisions will now be going to two buckets instead of to one.


Ok, the 2 decisions I assumed would give me the most flak.  Let’s take on the link-field vs closed-table one. 


First up: memory footprint.  Assuming a prime-sized table with a 75% load factor, and each entry in the table has an entry object.  Each object in the table has a 2-word object headers plus the memoized hash code, the link field, and of course the key & value.  On with HotSpot you’ll round up the object size to a 2-word boundary.  This means that for each Entry object you’ll use 2 (header)+ 1(hash)+ 1(link) + 2 (k/v) = 6 words.  Plus also the table is 3/4 full and thus you need 4/3rds word per entry in the table itself – about 7.33 words per table entry.


For the closed power-of-2 table, I’m using a load factor of 25%.  I get to skip out on the headers and link word.  So I pay (1 (hash) + 2 (k/v))*4 = 12 words per entry; for a load factor of 50% it’s 6 words per entry.  I waffled back and forth here quit a bit, eventually deciding for desktop and server machines the larger footprint was worth the better reprobe rate.  But I can well imagine doing the reverse here on an embedded device – in that case the 6 words per entry are LOWER than the 7.33 words per entry of the prime table.


Next: reprobing costs. Break up the space of all hash tables into 4 categories: cold tables, those with bad hash functions (including ones with the same value for all Keys – don’t laugh, I’ve debugged customer performance problems that stemmed from this!), large tables that don’t fit in cache, small tables that fit in cache. 

  • cold tables: Ignore ’em.  Premature optimization.
  • hot, but bad hash: No one is going to be expected-constant-time lookup here.  The link-field version turns into a gruesome pointer-chasing game.  The size-1 reprobe turns into a linear scan.  For specific applications, you can make sure that you Don’t Care about this option because you can make sure all your hot tables are well-behaved.  For a generic library that has to deal with all comers, including the losers, the linear scan is a better fall-back option.
  • hot, Big table: The cost of the cache-miss dwarfs the cost of the probe function, be it MOD or AND.  The cost of a miss & re-probe for the MOD is Yet-Another-Cache-Miss, but because MOD has good spread you rarely take this option.  Cost estimate: 1.01 (reprobe rate)*(300 (cache miss) + 30 (MOD) + 3 (load/cmp/branch))=336.  The cost of a stride-1 reprobe is a tiny cache-hitting load; you can afford to take lots of these.  Cost estimate: 300 (cache miss) + 1.2 (reprobe rate) *(1 (AND) + 3 (load/cmp/branch)) = 305.  Cache-miss dominates and the reprobe fcn cost  is ignorable. I pulled the 1.2 number out of my hat; in the Misty Past I achieved this reprobe rate on a power-of-2 table (25% load factor) for all the strings in the source code of a large C program.  I totally SWAG’d the MOD reprobe rate.  YMMV here, but if you have a decent hash function at all then the MOD-vs-AND question is ignorable.
  • hot small table:    The MOD cost dominates: 1.01*(30 (MOD) + 3 (ld/cmp/br)) = 33 vs 1.2 *(1 (AND) + 3 (ld/cmp/br)) = ~5.


So either the MOD cost dominates or is ignorable (cold table or constant-hash or cache-miss dominates).


The reprobing argument also applies to the question of what happens when you get local hotspots in the table.  On average, such hotspots have not led to large reprobe counts (anecdotal evidence, my experience only), and a linear-reprobe can sustain lots of reprobes at low cost.  The idea of excessive reprobing leading to a table-grow is a new experiment I’m trying out here.  The problem here is that in a concurrent table computing the exact size is expensive and everybody has to agree exactly on when to regrow the table (lest somebody decide to grow the table instead of inserting in it and another thread decide to insert instead of grow).


Bruce Holt writes:

The thing is … with a hash table, the table stays the same size for quite a while, so what you’re taking the modulus with respect to stays the same for quite a while, so you can computer an integer reciprocal of the size when you resize the hash table, and then use the same reciprocal for a long time. So for a hash code of H and a table size of N (with reciprocal R = 2^32/N) you’re looking at:


bucket = H – ((R*H)>>32)*N
bucket -= N if bucket >= N


OK, this is about ten times the cost of a simple AND, assuming you can grab just the upper half of a multiply, and cheap conditional execution (both of which I have on ARM).


What do you think?  Did you consider and reject this?

Nope!  This is an interesting idea as a way to get a cheaper mod.  I have been using a different approach that gets me reprobe rates closer to MOD but with the cheap AND function:  reprobe using a different value for each key, something that’s prime modulo a power of 2 (i.e., any odd value).  I use the original key OR’d with 1.  So if the first probe is at index 16, I’m reprobing with offsets of 17; if the first probe is at index 12345 I’m reprobing with 12345.  Same bad locality as MOD on a reprobe but waaay cheaper to compute.  Since every key gets a different reprobe chain you don’t get the local hotspots that a stride-1 probe does and thus get the better average reprobe rate.


Bruce Holt writes:

But .. you’re doing calls to Object.hashCode() before it, and Object.equals() after it so I’ve got to think the speed difference might not be that significant.

Object.hashCode() is well optimized in HotSpot (load the object header, some bit-fiddling & a branch).  I assume it’s cheap on IBM’s and BEA’s JVMs as well.  The equal’s call is guarded as heavily as I can (must have equal hashes, unequal key pointers, etc) but in the end everybody must call equals() and this is generally a virtual-call as well.  On Azul hardware v-calls cost roughly as much as a AND-LD/CMP/BR reprobe cost.  On other hardware v-calls cost roughly as much as a MOD-LD/CMP/BR reprobe.


My final bit of advice to you:  Since you’re looking at the embedded space, I assume you have a closed application set and a finite number of hashtables.  Inspect your hot hashtables for having good hashes, and use the closed power-of-2 table with a 50% load factor with more generous table-grow requirement so your table tolerates longer reprobe chains.  The 50% load factor will use less memory than a prime-table-with-links, and with a good hash function you don’t need the better spread that MOD gives you.  If the tables are small use my variant-reprobe-size trick, if the tables are large stick with the stride-1.  And always, profile first!