Once Again I Put Down my Hacker’s Keyboard to Bring You This Trip Report.

Once Again I Put Down my Hacker’s Keyboard to Bring You This Trip Report… sigh, all good hacks must come to an end, or at least be put aside so Real Life can intervene.  I went to DC for the ASPLOS and VEE conferences and all you get is this lousy blog.

 

2009 ASPLOS & VEE Conferences



ASPLOS & VEE are co-located in D.C. this year.  Mostly the trip itself was uneventful.  I was very busy either attending the conference sessions or preparing for my own talk and I didn’t get around much.  The flights themselves were mostly uneventful (ComAir’s HQ in the midwest lost power from a tornado, so I was stuck on the tarmac at JFK for an hour waiting to get permission to fly into DC – first time I’ve been delayed by bad weather when the weather between the source and destination airports was perfect).  The most notable point was that the hotel used WayPort as their internet access.  WayPort sucks horribly; I’ve used them in the past and swore I’d never use them again – but I forgot to ask when checking in to the hotel!  WayPort charges $13/night, and service is filtered through some horrible layout routing through TimBuk Too – average ping times to hit a server in the DC area was about 600msec (ping times from AT&T wireless in the conference room were ~50ms).  To add insult to injury, some other conference attendees stayed at a nearby hotel – which was cheaper and had free WiFi AND the free stuff was faster!  Definitely having WayPort as your WiFi provider is sufficient for me to change hotels.

Cliff Notes: I had a wonderful 1.5-long 1-on-1 talk with Monica Lam (http://en.wikipedia.org/wiki/Monica_S._Lam); she didn’t know what Azul does.
Cliff Notes: More interesting talks with Jon Shapiro (http://www.eros-os.com, who’s busy implementing a high-level low-level language for high-availability systems).

Reality Check: yes again a Mac failed to be able to output to the projector, and we all waited while the presenter hotswapped to a PC

Caveat Emptor: As usual for these talks, I only attend some sessions, I write stream-of-consciousness reports which I only lightly edit and the reports tend to be brutal and short.  They are roughly sorted here by my interest level, and not by e.g. time order.

General Cool Stuff:
Keynote: Monica Lam’s Vision of the Future of Human / Computers Interaction
Getting Wrong Answers Without Doing Anything Wrong – all your base are belong to us
Text Processing in Parallel in a Register – wicked fast
Nanoscale Computing – wicked scary
Demystifying High-level Low-level Programming – how to write Java-in-Java
Asymmetric Chip Multiprocessor – 1st good use I’ve seen for these
Leak Pruning – Reported on before, but still good
TRIPS – Funky dataflow/VLIW hybrid (but not the weirdest I’ve recently seen!), too bad it didn’t pan out

HTM Talks
Which I read but due to a snafu failed to attend the talks.   🙁
Maximum Benefit from Minimum HTM
 – not so minimum (bigger than Azul’s HTM!) and lousy results
Rock’s HTM – a peek at Sun’s experience with Rock’s HTM; mostly Rock’s HTM is really really weak

Multithreaded Bug Finders
CTrigger – The Winner in this section: cleverly induced sleeps trigger race bugs reliably
Anomaly-based Bug Detection & Prediction – It’s an, ahhhh, interesting hack
Isolator – silently prevent race bugs

Making Programs Deterministic
Cliff Notes: Major flaw in the thinking here: Real Programs have Real Random Inputs.  Unless you capture & replay ALL external events, minor things like network-packet-arrival-order and the chaotic butterfly effect (http://en.wikipedia.org/wiki/Butterfly_effect) ruin your determinism.  So I think all these papers are tilting at windmills.
Kendo – Stock hardware, but requires a race-free program
Deterministic Replay – Goes only 1/2-way to preventing the Butterfly Effect
DMP – Deterministic Shared Multiprocessing – No replay attempt, just full-speed determinism (assuming no Butterflies)

Other Stuff
Talks I attended but aren’t otherwise interested in…
Memory Buddies – Yet Another Use for Bloom Filters
Live VM Migration – pre-copy push still beats post-copy pull
Validate – Microsoft patches Are Bad ™
Raytracing vs GPUs – Doom!
Commutativity Analysis for Software Parallelization – the original work is WAY more interesting than this follow-on
Predicting Yield of GCs – Yea Olde Hat Trick
Phantom BTB – winner of the Most Unrealistic Hardware award



Keynote: Monica Lam

Why do computers break?  When we buy something, we expect it to work.  If it breaks, you buy another one.  But with computers they break on their own (e.g. u$oft update) or you can break it doing “normal” stuff – and if it breaks it doesn’t help to buy another one – the new one is broken already.

Virtual Desktop – idea: keep the virtual machine in the datacenter, and do remote login.  Does not work well – requires WiFi everywhere.

With each new comp generation, 10x cheaper & 10x more users.
Mainframe-> mini -> worksta -> PC -> laptop -> cell
Also see 3 industries: internet, media/TV, telephone
Also see 3 networks: cable/satillite, broadband/phone, wifi/wimax
Also see 3 computing models: cell (always on, pocket), PC (rich legacy, offline exec, locality of reference), Cloud (device independent, central management).  
Things like GoogleGears or Virtual Desktop are trying to split the Cloud/PC model.

Guess where we are going?

1- (Man) a phone-y thing: key, cache, window into digital cloud, ID, personality, asserts, internet access.  Small, always on.  Can be lost easily & replaced easily.  No importantstate, but many Gigs of flash cache.  No real compute power.  Small visual display & keyboard.

2- (Earth) Can (1) plug into a PC.  Get big screen & keyboard, get fast processor.  Get hot DRAM, teribyte disk drive.

3- (Heaven) Backs onto the internet/cloud.

This model has some different security issues.  DataCenter is probably secure (or can be?)  You hold key in your hands.  Can we make path from hand-held-key to datacenter secure?

Trends – green, Bring-Your-Own-PC, Mac-vs-Windows.  End of Corporate PC model.  80% of people use personal PC for work; 45% of people use work PC for personal use.  25% of workers change security settings, etc.

SO… want: {User, Ubiquity, IT}  Separation of Concerns.

x86 virtualization seperates: hardware from VM; hardware from OS; different hardware configs; different user roles

MokiaFive company – Desktop-As-A-Service.  LivePC images “in the cloud”.  Managed on central server.  Can go up to any mac/pc and run “your desktop” from a USB stick. USB stick/phone is a local cache only – backed up on the cloud. Think of USB memory as a “network accelerator”.

Nice live demo.  From USB stick to a selection of PC desktops is not very long (+2 mins for a new PC to install software from stick).  No hotel wifi?  Claims to be sync’ing with cloud??? 

Nice integrity model: basically re-install the OS clean each time on a fresh VM-ware layer with anti-viral.  Intent is to clean out root-kits with each fresh login to machine. 

Graphics is hard: many games are wired to DirectX, so losing much 3D performance.  Trying to bypass all the virtualization layers for DirectX.  But… guest & host have different address spaces; but graphics has lots of pointers; fast pointer munging is hard.

Other services: LDAP, revocation, offline access, logging, dlp. Encyption, Storage Virtualization: reju, incremental & atomic restore; secure access.  Uniquification, single-sign-on.

Seperate-of-Concerns Principles – 90% of code handles exceptions (only 10% does the normal stuff).  “Re-Install the OS” as a root-kit prevention.  e.g. system is structured as a fresh-OS-install because of the rare root-kit issue.

Bootstrap from a small trusted base: baremetal VM; LivePC player;  IT-managed base image (Windows+AntiVirus); finally user-apps & data

Like the Azul philosophy about GC – make the hard case (FullGC, infection from rootkit) common and then get it right – then all else follows.

Need to keep users ownership of own data – avoid the Big Brother mentality.  Give users a choice on where data is kept.


Getting Wrong Answers Without Doing Anything Wrong

Variations in unix env variable size, can see 15% variation in performance!  Running gcc -O2 vs gcc-O3, 400.perlbench.  Bunch of runs at each setting very repeatable per-run.  But with different env sizes see huge variation on 400.perlbench & ‘ibm’ benchmark.  

Conclusion: setting unix env variables can change your conclusions.

Also, sorting object files on the link-lines differently changes layout in memory, and in I-cache layout.

Conclusion: linking with ‘ld *.o’ is different from ‘ld a.o b.o’.

Bias holds true even if looked at a large benchmark suite, or across different CPU #’s from Intel.

See a prior paper (which I now cannot find!!!) which studies this carefully.  Note that there are a bunch of JVM/Java benchmark issues here that also have been written up recently. 

Cliff Notes: I’ve dealt with both of these in the past; so has Mark Moir (Sun) and Bean Anderson (SGI).  I discovered the need for I-cache layout while working for Motorola on PPC chips (somebody else actually did the implementation).  The results gave us an average 10% performance boost on a bunch of benchmarks, plus removed a huge source of variation.  Later at Azul, with our shared low-associativity caches we discovered that stack placement was crucial to avoid conflicts, fixed with a little random offset to each stack base.

Cliff Notes: What are other large systemic biases?  Are these two it?  And obvious question for a JIT – should I start doing I-cache layout games for performance?  How about making copies of an existing hot method and running both old & new and doing sampling, and pick the winner?


Text Processing in Parallel in a Register 

Cliff Notes: this technique is insanely fast
Like SIMD, but char/bytes within a standard GPR.  So 4 or 8 chars in a GPR?

Doubling: field width from n-bit to 2n-bit 
Inductive: repeated doubling.

Parallel bit streams; 1-to-1 correspondence to byte stream.  Vastly faster XML parsing; better cache locality; better branch prediction.

So: take 1st bit from each byte, and jam 64 bytes (bit #0 only) into 1 GPR.  Take 2nd bit into 2nd GPR, etc.  Basically, load 64 bytes into 8 GPRs, but bit-striped.  Need architecture support for bit-striping pack/unpack.

Now break up desired operations (e.g., compare for lexical token boundaries) into bit-field operations.  Can do compares of 64 bytes for tokens by simple ops on 8 GPRs.

Cliff Notes: really lousy presentation.  Hope the paper is better. Guesswork: take this string “java/lang/Object(Crunkify)” and find the ‘(‘ tokens.

  GPR #0 – xxxxxxxx – bit 0 from all bytes
  GPR #1 – xxxxxxxx – bit 1 from all bytes
  GPR #2 – xxxxxxxx – bit 2 from all bytes
  GPR #3 – xxxxxxxx – bit 3 from all bytes

My turn for a lousy presentation: Look at bits for ‘(‘.  Want some combo of 0’s & 1’s from the GPRs.  So e.g. XOR w/-1 all GPRs needing 0’s – now need them as 1’s. Then do 7 AND’s (parallel log tree AND).  Any bits still set represent the ‘(‘s.  Nicely wide-issues per 64-bytes.  Expect to compare 64-bytes for a single char in maybe 6-8 clocks!?!?!?!  Plus evil pack/unpack issues/costs.

Wrote some code to demo finding ‘(‘ in a long string using maybe 1/10th clock per character, plus whatever the packing costs are (which probably dominate!).  In this code, ‘b0’ through ‘b7’ are 64-bit values holding bits 0 (through 7) for 64 chars in turn.  Here’s the test-64-chars-for-paren check:
      // The open-paren char, '(', hex 0x28:
      //           0     0    1     0    1     0     0     0
      long res = ~b7 & ~b6 & b5 & ~b4 & b3 & ~b2 & ~b1 & ~b0;
      // A non-zero res means a '(' is found, and the bit-number is
      // the index into the 64-byte array where it was found.

See attached source code for TextScan.java


Nanoscale Computing 
128 bytes of ram; 5micrometers in area

Smaller version: 8 bytes of ram; 1.5micrometers in area; similar in size to largest virus.  Chemical sensors report results by directly changing bits in the instruction stream (e.g., flipping a bit changes a JMP to a NOP, in effect making a conditional branch).  Can detect adequately running at 100Hz (not 100MHz).  Can write small simple test/loop codes in this form factor.  Can e.g. count molecule rates or virus rates in a solution (injected in bloodstream).  

Cliff Notes: Very Very Scary.  See Vernor Vinge’s dystopia “A Deepness in the Sky”. http://en.wikipedia.org/wiki/A_Deepness_in_the_Sky
http://en.wikipedia.org/wiki/Smartdust


Demystifying High-level Low-level Programming

Goal: Abstraction w/out guilt.  Separation of concerns.  Nice historical perspective.  Issues go back as far as 1975 – using C for writing an OS was a big issue back then.  So it’s time for the modern revolution of the same thing.  JVMs for OS’s.  

Various C support tools: coding styles; Purify; idioms; tools (Valgrind).  Also Systems-languages – BLISS, Modula-3, Cyclone, etc Or 2 languages: JNI+Java JikesRVM extends Java, also OVM, Moxie, DRLVM, C# (Bartok).  Usually in context of runtime research.  This talk: Jikes, Java-in-Java.

Containment:
– minimize exposure to low-level coding
– extensible; requires change fast, but language changes slowly
Encapsulate: contained low-level semantics.
Fine-grained; separation of concerns.

“Address” is Magic, like unsafe.
“Address foo; int x = foo.loadInt(offset)”
More types for “void*” in Java, or generic OOP pointers.

Intrinsics: contract between you & the compiler; at the Java level it has an empty implementation.  But the compiler will always insert the correct code for the e.g. prefetch intrinsic.

Also annotations to do e.g., turn off bounds-checks or null-checks (performance hints, but break safety).

Scoped semantic regions – scoped rule changes, e.g. “NoNew” – no allocation in this scope (HotSpot uses this technique extensively), or “NoBoundsChecks” in this scope, etc.  Unboxed is another annotation – no class header, no v-table,  basically a C struct.  Also can do direct field layout.

@RawStorage – basic static C memory layout.  Backed by native width raw storage, only accessed by native-width C intrinsics.

Can both compile “fast” – down to raw bits, and “virtual” – to some virtual implementations allowing full GC on full simulation.


Asymmetric Chip Multiprocessor

Idea: instead of e.g. 16 cores on a die, have 12 for normal work and 1 “fat” core for hot contended locks.

Cross grid of threads vs open-tables in MySQL. (looks like ref-counting of MySQL tables).  Blah – implementation looks alot of NmethodBuckets in HotSpot.  2-d grid.  Updates are complex & require complex locking.

Look at some locking traces.  If we could accelerate the critical sections then we get a Amdhal’s Law approach – removes the serial portion.

Idea: switch to the one fast CPU on a contended lock.
Seeing speedups of 50% or so for 12-16 cpu system, where program was seeing e.g. 5x speedup and now gets e.g. 8x speedup.  One of the benefits: locks sections always executed by same CPU, so caches remain ‘hot’ for hot-lock code.

Cliff Notes: This is the first real use of Asymmetric Multiprocessors that I believe in.


Leak Pruning 

See it & reported it before.  Survive OutOfMemory & leaks by reclaiming predicted dead objects.  Used a simple X86 read-barrier w/6% slowdown.  Hoping to see new slowdown numbers for the read-barrier: see 3% slowdown on Core2 & 5% on Pentium4.  Using 2 bits per reference; bit 0 is for stale-refs; bit 1 is for pruned refs.

So instead of throwing an OOM, look thru heap and see if find a dominate object.  Assume dominate object is dead & leaked; “Cut” link to object and replace with a poisoned pointer.  Touching this pointer
throws the original OOM.  But if never touch object, then never throw OOM.  Cut links allows dead object (and his transitive closure) to be reclaimed.

Try to predict roots of dead datastructures.  So profile max time between last use (using something like GC age bits, but exponential counter bits).  Also need to track which refs are keeping most bytes alive.  

HashEntry -> PreparedStatement -> ParserInfo -> byte[][] -> byte[]
                               -> ResultSet -> Field[]

Compute during Phase 1:
Sometimes the HashTable grows, and PreparedStatement gets touched.  So max-stale-then-used for HashTable-> PreparedStatement is 16-32 GC cycles.  But max-stale-then-used for  PreparedStatement-> ParserInfo is 0-1 GC cycles.

Compute during Phase 2:
Count bytes kept alive for each type of thing that has a low MSTU number.  See PPS->ParserInfo has 420M kept alive.  So kills all edges from old PPS->ParserInfo (but not young ones, because MSTU is kept per object?  I now think he keeps MSTU per-class, so no bits in object but can suffer from common types like Object[] being used both in leaks and in hot code).  Don’t worry about DAG sharing – ends up blaming the leak on 1 ref when it’s really kept alive by several refs; just means on 1st go-around you prune one edge – but leak remains alive.  On 2nd go-around all the leak is blamed on object type#2 and pruned then.

Results are decent – keeping some complex real apps alive essentially forever.  Performance is decent, but near OOM conditions things slow down as first come frequent GCs, then LeakPruning kicks in and cleans the heap back out.

Cliff Notes: Wondering if could do read-barrier check on low bits by e.g. loading from the loaded pointer with an X86 op which faults on odd addresses?  Means faulting on stale & poisoned refs; faulting on poisoned is OK (means throwing OOM anyways), but faulting on stale might be too frequent still.


TRIPS, McKinley et al 

TRIPs is an alternative processor architecture, with a square 4×4 grid of functional units (GPUs); registers are along one “wall”, D-cache is along another 2 sides, etc.  Inter-GPU traffic is explicitly scheduled, but with a data-flow-like flavor (static scheduling, dynamic timing).  I’ve reported on this project before.  This talk is essentially an end-of-project summary.  And the summary is: not much better than a Core2 (if the chip had been built with Core2 technology).

Each chip has 2 processors; 1M L2.  16GPUs, 128 registers, 32K L1 cache.  Processor is distributed.  4×4 grid of GPUs; registers along one side; D-cache along other sides. Window size of 1024 instructions.

Block atomic execution model; dataflow between blocks.  Blocks are single-entry, multi-exit.  1 non-speculative block; up to 7 speculative blocks.

Heavy use of predication to build larger blocks.  Direct instruction communication model.  Special instructions to handle large fan-out.  Large block of conditional/predicted stuff mapped onto the 4×4 grid of FPUs; using reads from registers and reads/writes to d-cache.

Larger blocks require more scheduling; harder to place.  What is the utilization of the large instruction window?  Both scheduling, plus e.g. cache-miss & branch mis-predict.  Static delay: scheduling issues; 1-clk to move data from FPU to FPU.

Benchmarks: 53 compiled benchmarks; 13 hand-optimized benchmarks; 2 hand-optimized and hand-scheduled.  Gather numbers of the actual chip, except when needs to use SIM numbers (because chip didn’t gather the right data).

Generally, out of the possible 128 instructions in a block compiler is able to get about 60 instructions.  Hand-optimized code uses smaller blocks because better scalar optimizations reduces instruction count.  
About 1/4 of all instructions are predicated & never executed.
About 1/4 of all cycles are lost due to scheduling.
About 1/2 of all cycles do useful work.

Very large code-bloat: 55% bloat from nops, or direct reg read/write ops, 18% from predicated useless, 27% from instruction-replication optimizations (loop unrolling, inlining).  4 SPEC benchmarks have i-cache miss rates > 10%.  Direct memory accesses are much lower (no spill ops?)

Need variable-sized blocks; would reduce the code bloat from 11x to 6x (over canonical Alpha code).

L1 is partitioned.  Compiler can acheive 2 mem-ops /cycle on vector add; hand-optimized can hit the max of 4 mem-ops / cycle.  Again for matrix-multiple the compiler hits 1 flops/cycle; hand optimized hits 7 flops/cycle.  

Clock speed is 366Mhz; mem speed is 200Mhz; process tech is 130nm.  Compare to a Core2 but boost TRIPS “as if” if gets the 65nm process. Compiled versions are competitive to Core2.  Hand-optimized about 3x faster than Core2.

What works: dramatic reduction in data-cache & register file refs.  Compiler is able towork (to schedule to this weirdo thing).  But performance isn’t very big over a Core2.  Challenges: dynamic code size; dataflow overhead instructions; I-cache efficiency; predication overhead; generating efficient code for control intensive codes.


Maximum Benefit from Minimum HTM

Running a (supposedly) simple HTM on a transaction-aware Linux kernel.  Their proposed HTM is somewhat stronger than Azuls’, in that they support parallel independent transactions (the main one, and ones run inside interrupt handlers), a special version of CAS, and graceful overflow to software.  Also they holding pending transactions in a much more generous L1 than Azuls’.

Results look pretty anemic and fragile, which is on par with Azuls’ results (4x speedups on 16-way boxes; 60% of exec cycles wasted on retries).  On the plus side, they at least attempt a reasonable benchmark set.


The Rock’s HTM

A look at Sun’s Rock’s HTM.  It’s a lot weaker than Azul’s – max of 16 (or 32, depending on how many hardware threads you use) pending writes; aborts on function calls, TLB misses, mispredicted branches & some fp ops, – plus the same reasons Azul’s HTM can fail: eviction from the L1 (due to transaction being too large, or conflict from another CPU).

Major discovery (just like Azul discovered!): feedback is crucial to decent heuristics.  Your failed HTMs need to report as much as possible about what went wrong so the heuristic stands a chance of getting it right.  This required a hardware spin (Rock2).  One useful heuristic is the standard backoff mechanism: if there’s true contention, then backing off can spread the memory references far enough apart to allow the HTM to succeed (presumably succeed with less overhead than a simple spin-lock solution).  

They report hashtable numbers, but wildly concurrent hashtables are commonly available.  
They report using the Rock HTM for doing an N-CAS.  This is largely successful, and might simplify many libraries.


CTrigger – Atomicity Violation Bugs 

Software testing is inadequate.  Poor interleaving during standard stress testing.

CTrigger – for any given program input, force interesting interleavings.  There is an exponential interleaving space for each input, but stress testing is using random interleavings.   Previous work only controlled the context-switch interleavings.  CTrigger focuses on interleavings that likely hide atomicity bugs and rarely happen during normal testing.

Focus on interleavings that likely hide bugs: e.g. 2 nearby refs to the same memory in one thread – and another write to same ref in another thread.  Force these accesses to interleave by injecting “sleeps”.

Tried on 7 apps (server, client, scientific) on 8-core machine.  143 total Unserializable Interleavings (program interleavings deemed interesting from atomicity standpoint).  Stress testing only hits about 80 to 100 of the U.I., out of 143, even for many many runs.  Basically some (1/3rd) of the U.I.s are very unlikely to hit with stress testing – this demonstrates the key weakness of standard stress testing.

CTrigger looks at the access traces and attempts to force the interleaving.  Needs to trim out infeasible paths: locking forces many orderings to be impossible (i.e., the code is correct and no race happens).  Some refs are inside the same lock; some are Happens-Before on thread-create or thread-join.  Typically 90% of apparent bad interleavings are infeasible.

Focus on rare interleavings also (not-rare races happen easily in stress testing).  If the “local gap” is small – time between 2 refs in the same thread is short, then the race is narrow – and hard to hit in stress testing.  Also look at the distance between external ref distance in time to the gap.  Sort these potential races by likelihood.  Instrument the program to add an artificial delay at a few points to try & force these rare races to happen often.  

Then run program as normal.  Program speed is similar to unmodified program, because artificial delays are small and few in number.  Program runs fine on multi-core machines; using the multi-cores as before.  But now the rare races are common, and stress testing finds them quickly.

Typically find bugs now 100x faster than before!!!  Found bugs in all these apps.  Was able to repro the breakage 100% of the time, generally very quickly.


Anomaly-based Bug Detection & Prediction

Defect – the Bug
Infection – Defect is triggered ==> program state is corrupt
Failure – An observable error (Crash, hang, wrong result)

Infections spread over time, or get overwritten (hence bug is squashed in this case).

Software debugging: hypothesis bug; change code; attempt to repro bug, etc.

– detecting anomalies; out-of-bounds addresses, unusual control paths, page faults or redundant  computations.  So profile program to teach the detector what the “good” behavior is.  Some values are in fixed ranges during the teaching runs, or some mem locations only accessed from a few ld/st ops; or abnormal number of loop iterations.  Current crop of anomaly detectors have a fairly high false-positive rate.

– Now isolate relevant anomalies: compute dynamic forward slices from anomalies to crash point.  Ouch – this step looks really expensive.  Rerun the broken program, propagating tokens from anomaly site to crash site.  Use binary search on the anomaly set.  Can reduce the count of false positive anomalies by a factor of 10.

– Now “validate” that have found buggy line of code: NOP that line out and see that you can execute further.  

Still don’t know what the fix is, but can point out the step that e.g. corrupted memory.

Appears to scale up to e.g. gcc (330KLOC) found a bug in distributive laws causing a loop in RTL tree optimization (C2 used to suffer these routinely).


Isolator

Isolation in concurrent programs: detect isolation guarantees; i.e. access in critical regions from non-locking threads (or incorrect locking).  Bad threads can break well-behaved threads, (the usual motivation talk for debugging data races, I wish people could skip this or at least limit it to 1 slide & 1 minute).  Also STMs only have weak atomicity, so bugs still happen for well-behaved STM threads.  Cliff Notes: This is the key: some other buggy thread breaks your well-behaved concurrent thread; studying the break point is useless.

Isolator: Program executes as-if no isolation violation.  Uses a combo of data-replication, virtual memory & code instrumentation.  Does not require the whole program.  Requires “guarded by” annotations.

Cliff Notes: just having thread-level TLB protection would let you do some COW-style (http://en.wikipedia.org/wiki/Copy-on-write) protections.  Just write-protect all pages touched during a locked region (except allow writes from self).  All other threads that attempt to write without holding the lock will get a TLB fault and can be stalled (and reported as racing).

Stalling can cause deadlocks, so need dynamic deadlock detector.  (Youch!)
Either allow racing access OR throw a DeadlockError.

This above option is done by Isolator, except requires annotations rather than lazily discovering touched variables.  

Issues: expensive to change protection levels on every lock.  Fix: “lazy unprotect” – only remove protections on-demand, as other threads need them.  Could “shatter” big pages, or split unrelated objects to private pages (via GC).  
Cliff Notes: similar to HotSpot’s Biased Locking, but this wants per-thread page-protections; really want user-mode control over per-thread page protections.

Issues: page-size granularity – has fragmentation overhead for fine-grained locks.


Kendo

Only works for data-race free programs.  Very efficient deterministic interleaving.  Progress is made with “Deterministic logical clock” time.

Work on SPLASH benchmarks.  See 18% to 27% overhead, but deterministic always.  Requires no special hardware.  

Each thread maintains a “logical clock” similar to Azul’s COUNT register or TSC on X86.  Could be e.g., a “bytecodes executed” counter.  Globally, the thread with the lowest “logical clock” has the right to do a e.g. lock acquire (non-determinstic action).  Assumes all inter-thread communication happens nicely inside locks.

Enforces lock-acquire ordering, except for nested locks (have to prevent deadlock, since basically adding another locking layer).

Cliff Notes: Mostly interesting because this could work on stock hardware (but it is limited to race-free programs which are precisely those that you’d like this to work with).


Deterministic Replay
“Time Travel” for concurrent debugging

Phase 1 – record non-deterministic events
Phase 2 – replay

Software versions have been around for ~15 years.  Now commercially available.  Also available are libraries, compilers & OS’s to help.  SW stack flaw: limited to speed of 1 cpu

How do you integrate HW & SW stacks?  i.e., If HW has captured a log, where does it go?  Do not want to capture or replay the SW stack that is doing the SW recording.

Also need the HW to be running across the whole machine, so that all processes are using it because interacting processes need to all be deterministic (see butterfly effect above).

So some abstraction: not all elements of the system are being recorded and can be replayed.  May need to simo record multiple user-mode processes in the same recording “sphere”.  Only user-mode threads are being recorded (OS support needed to virtualize HW, but also OS needs SW record/replay that is process specific).  Really he’s making a weak stab at recording everything to avoid the butterfly effect.

Now record 2 different kinds of info:
HW info: memory interleave log
SW info: system stuff log; scheduling; context switch; I/O; signals

Replay also has to handle, e.g. malloc has to be deterministic.
Cliff Notes: but not GC, although HashCode has to be deterministic.

Key Issues:
1- Deterministic injection of data into controlled “sphere” (I/O)
2- Using fewer processors during replay than were used during recording
3- Emulating vs re-execute system call (e.g. fork requires a real system call) – but this makes a lot of OS state changes.

Key Issue #1 – Capturing ALL I/O
– Random OS calls write into user-space (i.e. setting return values in memory).  But during replay need e.g. the SAME “fd” as before.  But OS under no obligation to hand out the same “fd” at an open().  Or OS writes data into a buffer – BUT another user-thread is ALSO writing into the same buffer.  

Key Issue #2 – Not enough CPUs during replay
– I think he’s expecting HW to do the memory interleaving on replay!!  But basically, it’s time for thread#1 to run – but the OS wants to schedule thread#2 – just need to block #2 and allow #1 to  run.  Very slow (constant OS thread scheduling overhead).

Log size – in bits per kilo-instructions.  
Small: 3.15 bits/kilo-instruction.
Recording performance overhead: 20-50%.
Replay performance: using replay HW!?!?!  About 2x slowdown during replay.

How to handle large datasets?
 


DMP – Deterministic Shared Multiprocessing

Hard to debug, etc, etc…
So just make the whole thing deterministic.  No “Heisenbugs”; can use time-travel debugging.  Can test like it’s a serial program.  No attempt at record/replay really – just force same ordering every time.  Cliff Notes: Suffers butterfly effect.

Instructions communicate via shared memory.  Only care about “communicating” instructions.  Can serialize – but that blows performance. 

So really want to figure out which instructions really communicate. Requires HW support.  

Conceptually have a “deterministic token” which is handed out to processors and then is passed around all the CPUs in round-robin format.  This token totally orders all instructions.  Each “quantum” is roughly 1000 instructions, but varies.  Now filter out non-communicating instructions.  Allow the non-comm instructions to run in parallel.  Leverage cache coherency to track when cache lines move.  Build a “sharing table” to help order memory accesses.

Now add some TM support to help get more parallelism.
Now come some quantum-building optimizations: break at sync/lock boundaries.
Communication is often bursty: so break quantum after a series of stores to shared memory.


Memory Buddies

Sharing memory between VMware VMs(?)  Hypervisor finds identical pages and shares them under the hood.  (Probably good for e.g. shared libraries). See 1/3 reduction in memory usage in VMware ESX server. Get sharing only if you run similar apps on the same hardware.  Placement is hard in datacenter; lots of clients & software stacks; things change over time.  Manual placement is tough.

Goal: estimate sharing ability across many physical machines and many virtual machines across the datacenter.

Create 32bit hash per 4K page.  1MB hashs per 1GB of ram.  Forward to other hosts.  Comparing lists is slow.  So use a Bloom filter.  With hash lists, just sort & compare.  With Bloom filters you use a dot-product of bits; very accurate in estimating the number of elements shared between Bloom filters. 

New VM launch – run on staging host.  Figure out it’s memory footprint.  Compute it’s Bloom filter per page.  Compare it’s filter against other Filters on all other hosts in data center.  Find closest fit (on machine with enuf memory & cpu) and move the new VM there.


Live VM Migration

How to migrate VMs cheaper (force their free memory pools to shrink before migration; similar to having a JVM do a GC & heap-shrink before migration, then inflate back to normal size in the new host).

Contrast to the usual approach: 
– 1st iteration copy all
– later iterations copy any dirty pages
– after dirty workset is “small enuf”, stop VM, copy active dirty pages,
– restart on new host.
Goal: reduce downtime.
Con: Write-intensive apps have long migration times (because of constant page copying) and high network overhead.

Alternative: copy minimal CPU state & pages, start running on target.  VM demand-pages from source.  Also actively push pages from source to target.  Good: copy pages only once; lower total migration time & network overhead.  Works well for both read-intensive & write-intensive.  Con: resumed app suffers while pages demand-paged in.

Can make some predictions on future needed pages from apps page-faults – likely will want other pages nearby in virtual-memory-addresses soon.  Examples are walking an array or working a stack.

Make a psudeo-swap device in the guest OS for demand-pull pages (basically swapped over the network).  Allows the VM job to keep running in other threads or processes – only 1 thread is blocked on a page fault.  (Actual implementation has lots of warts due to nature of existing Xen infrastructure).  e.g. we have a much more direcy way to detect memory-full conditions and politely share memory between JVMS.

Results: feasible, but clearly more downtime over pre-copy in the common case.  But works well w/write-intensive VMs.


Validate

Patches are dangerous.  Delaying change is a safer bet even for security bugs: more likely for downtime from patches than your system gets hacked while you wait.  


Raytracing vs GPUs

Raytracing is more expensive per-ray, but less expensive per-object.  Can be parallelized.  Makes really great shadows & reflections.  Want to improve predictability in ray-tracing to get “good enough” frame rates.

Trying to use SIMD.  Primary rays are parallel, easy to SIMD.
Seconday rays are incoherent, not good for SIMD.

Cliff Notes: no surprise but modern GPUs are not well suited to raytracing work


Commutativity Analysis for Software Parallelization: Letting Program Transformations See the Big Picture

Sources of Commutativity – e.g. sum, malloc

Can detect it?  Check memory state before & after running fcn X.

But misses e.g. set’s (via linked-list) – linked-list order is different when swapping ‘put(A)’ and ‘put(B)’ calls, but semantically the same: both elements are in the set, but the set’s implementation stores them differently.

How to test for commutativity? Symbolically execute ‘put’ calls in different order. See if memory contents are same: “put(A);put(B)” vs “put(B);put(A)”.  If yes, then Yes – but suppose not.

Now check reader fcns: try is_member() (on what values???)  
Also check writers like ‘remove’.

Problem: exponential search space for equivalence.
Use Random Interpretation- 
– choose random input values
– interpret till conditional branch
– follow which way it goes; tests program
– Then back up to cond branch
– Adjust memory & values such that test fails
– Now follow thru on other branch; test program

Now combine memory states for each branch of program.  Using an ‘affine join’ to combine memory states???  Somehow correctness bug becomes an exponentially low; very very unlikely the Random Interpretation ever computes some wrong answer (declares as commutative some function which is not really commutative).

Looking at MediaBench & SpecInt on an infinite issue machine.  See 20% to 30% of all functions are commutative (with themselves?).  Only see speedup when commutative fcns are called near each other, but then can see parallel speedup (log tree combining?).  Average is 50% speedup.  Apparently assuming an extremely wide machine, as opposed to threads (ignoring communication costs & thread start/stop).

Cliff Notes: Random Interpretation looks interesting, not sure about the rest of thispaper.  In fact, this paper looks like a small delta (and very difficult to use correctly) over the R.I> work.   Instead see 2003 paper: “Discovering Affine Equalities Using Random Interpretation” http://research.microsoft.com/en-us/um/people/sumitg/pubs/rand_popl03.pdf


Predicting Yield of GCs

Amount of memory reclaimed by GC *can* be low.  But looking at minimal GC footprints.  When GC is not productive, then time to extend the heap.  Estimate GC yield by using Virtual Memory to track pages which are touched and which are not touched – not-touched pages are probably full of mostly dead stuff.

Cliff Notes: Yawn, Olde Hat Trick.
Cliff Notes: got bored; symptom of Day 3 at a conference; worked on slides


Phantom BTB

Cliff Notes: Winner of the most ridiculous proposed hardware, backed up a big speedup on the most-unrealistic simulated hardware.

Virtual BTB Design – as in Virtual Memory not as in VMWare.
BTBs today are {small,fast,inaccurate}.  Want {big,fast,accurate}.
Allocate space for the BTB in the L2(!?!?!).  
But latency to the L2 is large, so much prefetch the BTB data.

Getting tidy 10%+ speedups on unrealistic hardware.

BTB in L2 cache is large+slow.  Missing BTB entries are evicted to L2. Then prefetch from L2 based on previous missing branches (assuming that an upcoming BTB miss can be predicted from a just-prior missing branch).  Attempted lots of other predictors; most failed miserably.

Pay-as-you go model; BTB gets effectively larger as more pressure happens on the on-chip BTB.

Cliff Notes: HotSpot inline-caches wipe out the need for faster jump-register support in hardware; standard branch predictors work well with HotSpot.  Total clocks spent doing jump-register instructions are less than 1% of all clocks on realistic hardware.

<strongig><br><hr size="4" width="90%"><strong><strongig><br>import java.lang.*;<br><br>class TextScan {<br><br> static String test1 = "scan_for_some_token(like_this_open_paren";<br> static String junk0 = "scan_for_some_token_like_this_open_paren";<br> static String junk1 = junk0.concat(junk0);<br> static String junk2 = junk1.concat(junk1);<br> static String test2 = junk2.concat(test1);<br><br> // A TextScan is a lite object that's really 8 longs representing<br> // the bits of 64 bytes split 90-degrees from verticle.<br> long b0,b1,b2,b3,b4,b5,b6,b7;<br><br> public static void main( String args[] ) {<br> TextScan[] bits = bit_split(test2);<br><br> System.out.print("bits="+bits.length+"@(");<br> for( int i = 0; i&lt;bits.length; i++ )<br> System.out.print(bits[i]);<br> System.out.println(")");<br><br> // Scan for byte <br> System.out.println("Scanning for byte "+(byte)'(' + (char)0x28);<br> System.out.print("[");<br> for( int i=0; i&lt;bits.length; i++ ) {<br> // bit pattern 0x28 - 0 0 1 0 1 0 0 0<br> TextScan ts = bits[i];<br> // The open-paren char, '(', hex 0x28:<br> // 0 0 1 0 1 0 0 0<br> long res = (~ts.b7) &amp; (~ts.b6) &amp; ts.b5 &amp; (~ts.b4) &amp; ts.b3 &amp; (~ts.b2) &amp; (~ts.b1) &amp; (~ts.b0);<br> // A non-zero res means a '(' is found, and the bit-number is<br> // the index into the 64-byte array where it was found.<br> System.out.print(Long.toHexString(res)+",");<br> }<br> System.out.println("]");<br> }<br><br> // Split a String into an array of 64-bit longs.<br> // Parallel-split bits: bit 0 from 64 bytes go in 'b0',<br> // bit 7 from 64 bytes go in 'b7', etc.<br> public static TextScan[] bit_split( String s ) {<br> int slen = s.length();<br> TextScan[] bits = new TextScan[(slen+63)&gt;&gt;6];<br> for( int i=0 ; i&lt;bits.length; i++ ) {<br> TextScan ts = bits[i] = new TextScan();<br> // Suck 64 bytes into bytearray from String.<br> for( int j=0; j&lt;64; j++ ) {<br>	int idx = (i&lt;&lt;6) + j;<br>	byte b = idx&lt;slen ? (byte)s.charAt(idx) : 0;<br>	if( (b&amp;(1&lt;&lt;0)) != 0 ) ts.b0 |= (1L&lt;&lt;j);<br>	if( (b&amp;(1&lt;&lt;1)) != 0 ) ts.b1 |= (1L&lt;&lt;j);<br>	if( (b&amp;(1&lt;&lt;2)) != 0 ) ts.b2 |= (1L&lt;&lt;j);<br>	if( (b&amp;(1&lt;&lt;3)) != 0 ) ts.b3 |= (1L&lt;&lt;j);<br>	if( (b&amp;(1&lt;&lt;4)) != 0 ) ts.b4 |= (1L&lt;&lt;j);<br>	if( (b&amp;(1&lt;&lt;5)) != 0 ) ts.b5 |= (1L&lt;&lt;j);<br>	if( (b&amp;(1&lt;&lt;6)) != 0 ) ts.b6 |= (1L&lt;&lt;j);<br>	if( (b&amp;(1&lt;&lt;7)) != 0 ) ts.b7 |= (1L&lt;&lt;j);<br> }<br> }<br> return bits;<br> }<br><br> // --- toString<br> // Reverse bit-swapping to a String.<br> public String toString() {<br> String s = "[";<br> for( int i=0; i&lt;64; i++ ) {<br> byte b = 0;<br> b |= ((b0&gt;&gt;i)&amp;1) &lt;&lt; 0;<br> b |= ((b1&gt;&gt;i)&amp;1) &lt;&lt; 1;<br> b |= ((b2&gt;&gt;i)&amp;1) &lt;&lt; 2;<br> b |= ((b3&gt;&gt;i)&amp;1) &lt;&lt; 3;<br> b |= ((b4&gt;&gt;i)&amp;1) &lt;&lt; 4;<br> b |= ((b5&gt;&gt;i)&amp;1) &lt;&lt; 5;<br> b |= ((b6&gt;&gt;i)&amp;1) &lt;&lt; 6;<br> b |= ((b7&gt;&gt;i)&amp;1) &lt;&lt; 7;<br> if( b == 0 ) break;<br> s += (char)b;<br> }<br> s += "]";<br><br> return s;<br> }<br><br>}<br><br></strongig></strong></strongig>