IFIP WG2.4 Trip Report

IFIP WG2.4 remains one of my favorite mini-conferences to attend.  The group is eclectic & varied; the discussion vigorous. Everybody is willing to interrupt the speaker & ask questions.  The talks are usually lively and pointed, and this time was no exception.


The conference was held in Fort Worden, a historic Big-Gun (planes rapidly obsoleted the guns) fort of the Pacific Northwest, defending Admiralty Inlet and our shores from the marauding Russians … or Japanese… or somebody, surely.  Nobody ever attacked that way, so clearly the fort was a big success.

Meanwhile, the fort has been turned into a park and it’s beautiful.  The whitewashed buildings with green trim remind me of some mythical 50’s era America that never was, the kind that shows up in the backdrop of various movies or maybe “Leave It To Beaver“.  We stayed in the officers quarters and held our meeting in the old mess hall.  The officers lived quite nicely; the quarters are actually old duplexes in great shape, with high ceilings and crown molding and carved brass fittings everywhere – the house is clearly larger than my house and with a vast green lawn ta-boot.  In San Jose such a place would be upwards of $2m…

Fort Worden overlooks Admiralty Inlet from some high bluffs (classic fort-like location) and the surroundings are all gorgeous.  The weather varied from overcast & chilly to clear & crisp (alas, overcast and chilly was the mode on our all-day “whale” watching expedition.  We did lots of watching but saw no
whales of any kind.  And we bounced over heavy seas and got sprayed with frigid water most of the day.)

Since the Hood Canal bridge was out I got treated to a 3 hr drive the long way around the peninsula, but somebody else drove and the scenery was worth looking at.  The airplane portion of the trip was easy.

As usual, my reports are (very) brief, stream-of-consciousness writing.  I often skip around, or brutally summarize (or brutally criticize – but this group is above the usual conference par, so less of that is needed).  I skip talks; I sleep through ones I’m not interested in, etc, etc.  Caveat Emptor.

Synchronization Talks
Atomic Sets – another nice concurrency alternative
Fork/Join for .Net
occam / CSP – Uber lite-weight user-thread scheduling

Java Analysis Talks
SnuggleBug – Better bug reporting for Java
SATIrE – A complete deep analysis toolkit for Java
PointsTo – A comparison of various points-to algorithms for Java
analyzing JavaScript – Has a type-lattice as complex as C2’s internal lattice
Performance variations in JVMs – Noticed the wild variations in run-to-run performance of JVMs

Misc
Cheap Cores discussion
Compiler-directed FPGA variations
Clone Wars – huge study on cut-n-paste code in large projects
Business Process Modeling

Security
PDF Attack – “Adobe Reader is the new Internet Explorer”
The Britney Spears Problem
Squeezing your Trusted Base
Who you gonna call?

Evolving Systems
Swarms in Space
Finding emergent behavior


Frank Tip, Type-Based Data-Centric Synchronization

Locks are hard.  Auto-inference of correct sync in O-O programs.
Even with no data-races still have atomicity races.

Instead of code-centric locking paradigms, do data-centric locking.
Tied to classes with fields, leverage O-O structures

Group a subset of fields in a class into a *atomic_set*.
Then find units_of_work for that atomic_set.

Add language features to Java:
  atomicset account;
  atomic(account) int checking;
  …
  atomicset logging;
  atomic(logging) Log log;
  atomic(logging) int logCount;

So add ‘atomicset’ keyword, ‘atomic’ keyword and strongly type variables with atomic_sets.  Units_of_work expand to reach the syntactic borders anytime an atomic variable is mentioned.  Can expand unit_of_work by hand also.

Can at runtime cause atomic_sets to alias – to union together.  Used on ownership types (e.g. HashMap uses HashMap$Entry, so units_of_work on HashMap or HashMap$Entry become units_of_work for both).

Atomic_sets can be unioned; e.g. for LinkList the entire backbone of the list is one atomicset.  Lots of discussion – points out the issues with hacking all the Code Out There.

Has proven strong serializable properties on atomicsets.  Proven various kinds of soundness, including *internal* objects are only accessed by their owner, etc.

All concurrency annotations appear in the libraries but not in the client code – shows a nice concurrent example with 2 threads manipulating the same linked list.   

Has a working version using Eclipse & taking annotations as Java comments.  Using j.u.c.Locks; can do lock aliasing by simply sharing j.u.c Locks.  Applied this to a bunch of Java Collections; 63 types & 10KLOC of code.  Needs about 1 line of annotation per 21 lines of code for the libraries (and none in
clients)

About 15-30% slower than using plain sync wrappers on a single-threaded testcase; probably because the “synchronization” keyword is very optimized.  Performance is similar or slightly faster when running with 10 threads & high contention.


Daan Leijen – The Task Parallel Library for .Net

– A concurrent parallel library for .Net (think dl’s Fork/Join).

Infrastructure: locks/threads/STM, async agents (responsiveness), task parallelism (performance)

Gives a demo of a parallel matrix-multiply w/C#.  Very ugly code.

So instead, write a nice combinator…  

Standard task-stealing work-queue approach.  (private set of tasks that don’t need CAS, public set of steal-able tasks that require CAS).

Points out that want fine-grained parallelism, so can do work-stealing and do auto-load-balancing.  

— Effectful state.  Statically typed, with typed side-effects.  Type signature includes all side-effects (yes!)

  int fib(int n) { return n<=1 ? 1 : fib(n-1)+fib(n-2); }

But this program

  int fib(int n) { return n<=1 ? print”hi”,1 : fib(n-1)+fib(n-2); }

Does not type in Haskell – need the IO monad.  And now the function returns 2 results: the IO “hi” and the int.

So want some syntax to allow “auto-lifting” into a side-effect world – same function syntax, but auto-includes the IO monad as a result.

Then can prove that some stateful computation does not leave the function – that the state is entirely encapsulated & unobservable outside.  So can do type’ing which allows some state-ful work internally, but it doesn’t escape the function “box”.

  int fib(int n) { 
    ref f1 = 1; // side effects, but entirely internal to the function
    ref f2 = 1;
    while( n ) { sum = *f1 + *f2; *f1 = *f2; *f2 = sum; n–; }
    return sum;
  }


Peter Welch, Multicore Scheduling for Lightweight Communicating Processes

Google: kroc install

Carl Ritson did all the work…

Cliff Notes: Insanely cheap process scheduling, plus “free” load balancing, plus “free” communicating-process affinity.  Very similar to work-stealing in GC marking, but with “threads”.  OS guys take note!

Scheduler for OCCAM/PI.  Process-oriented computation (millions), very fine grained processors.  Message passing over channels, plus barriers.  Uses CSP and pi-calculus for reasoning.  Dynamic creation & deletion of processes & channels.

Goal: automatic exploitation of shared-memory multis.  “Nothing” for the programmer to do: the program exposes tons of parallelism.  “Nothing” for the compiler to do – it’s all runtime scheduling.  

Goals: scalable performance from 10’s of processes to millions; co-locate communicating processes – maximize cache, minimize atomics (lock-free & wait-free where possible).  Heuristics to balance spreading processes around on spare CPUs and co-locating the communicating ones.  

Usual sales job for CSP… which still looks good.  

A blocked process requires only *8* words of state (no stack for you!). Scheduling is cooperative not preemptive.  No complex processor state, no register dumps.  Up to 100M process contexts per second.  Typical context switch is 10ns to 100ns.  Performance heavily depends on getting good cache behavior.  So processes are batched.  Stacks are spaghetti-stack (no that’s why no stack saving…).  These 8 words include the process run-queues.

Batching: sets of e.g. 48 processes are run on a single core; that core is the only core touching this queue & round-robins them until the entire batch times out.  So no cache-misses.  A Batch is a Process, plus a run-queue.

A Core is a Batch, plus a queue of Batches.  Again, no other cores touch this list either.  

A Core with no Batches does work-stealing.  There’s some proposed batches available for work-stealing from each Core.  Cores need some atomic ops to manipulate the queue of Batches for these steal-able Batches.  Careful design to ensure all the interesting structures fit in a single cache line.

So until you need to enqueue, dequeue or steal a Batch – it’s all done without any contention for process scheduling.  These Batch-switching operations require atomics but are otherwise fairly rare.  

Key bit missing: scaling up beyond 8 or 16 cores with some kind of log structure.  But otherwise looks really good for very low cost context switch & work steal.  Lots of questions about fairness.

Batches are formed & split by the runtime (no compiler or programmer).  Aim is to keep processes that communicating on the same core (so it’s all cache-hit communication), but spread the work across cores.  Initial batches are kept small to encourage spreading.  Channel ops require atomics, but typically only 1 (sometimes 2).  Choosing amongst many channels might require up to 4 atomic ops.  Each time communication happens on a channel, the sleeping process is pulled into the run queue Batch of the awake process – thus naturally grouping communicating processes.  But as batches get large, need to split.  

Periodically can observe that if all the processes in a Batch are all communicating with each other in some complex cycle – then the spare run queue empties from time-to-time (with one runnable process).  So as long as the run queue empties now & then, then the Batch should stay together.  But if the run queue never empties, then split-the-Batch.  Split off a single process into a new Batch – which will drag it’s connected component together into the new Batch.  But eventually if the original Batch had 2 complete strong graphs,
they will now be split – and can be stolen onto other CPUs.  

So this kernel is about 100x less overhead than the same design done on a JVM.
🙂


Satish Chandra, Symbolic Analysis for Java

“Snugglebug”

Communication between the programmer & the tool tends to stop early: “Dear Programmer: you have a null-ptr bug here”…

Going to the next step: “Dear Mr Programmer: try running your method with *these*values and you’ll see the bug…”.  Programmer:  now that I think about it, these values will never arise.

We just forced the programmer to define the legitmite range of values.

More: Bug Report is more believable if given a concrete input is given.  
API hardening: gives the preconditions under which this library will throw an exception
Unit-test generation: inputs needed to make a specific assert fail.

Works by weakest-precondition.  Given an assert in the program, work backwards computing the weakest-precondition at each point until you reach the top of the method.  If this chain works, then hand the WP to a decision procedure; sometime the decision is “I don’t know” but many times you get an explicit counter example.

Good match for the counter-example problem.
– Yes loops are a problem, but don’t really need the *weakest* precondition.

Most Java features can modeled:
– heap r/w, arrays, allocation, subtyping, virtual dispatch
– exceptions: many bugs are here so we modeled exception handling try/catch
  very closely.

Have a path-explosion problem.  For heap updates, even a single path includes branch effectively depending on aliasing.

Has to do strong interprocedural analysis.  Lazily grow the call-graph consistent with known symbolic constraints.  First look for a conflict on a call-free path, then on call-paths but without needing to look into the call.  Finally, if I must peek into the call – but use the symbolic info the reduce the set of possible call targets.

Generally, no fixed-point.  Cannot expect programs to give loop invariants.
Work-list management: avoid various common pathologies.
Goal is to find *a* contradiction, not to explore all options.


SATIrE

http://www.complang.tuwien.ac.at/satire/

Shape analysis, points-to, flow-sensitive, context-sensitive, annotations, can read them in & out.  Applications to e.g., slicing.

Read C/C++ – large complex tool chain.  Reads in program x-forms in either a functional programming language or in Prolog.  Can combine the tools in any order at this point (very well integrated).  Output is C/C++ w/annotations at each step (or internal “ROSE” ASTrees, etc).  

Has interval analysis for integers, shape analysis, loop (bounds) analysis, points-to, etc.  Gives example of reaching-def’s in the functional programming language.  It’s a domain-specific language for writing compiler analyzes.  Can specify the strength of the analyzes in many cases: e.g. the depth of the allowed call-chain for context-senstive (set to zero for context insensitive).

Shape analysis gets may/must alias pairs, etc – and all results put back into the C program as comments/annotations.

Basically describes a large complex tool chain for doing pointer analysis on C & C++ programs.  The tool chain looks very complete & robust.


Welf Lowe – Comparing, Combining & Improving Points-to Analysis

…skips explaining what points-to is.

Clients: compilers, JIT compilers, refactoring tools, software comprehension.
Different clients have different needs: speed, precision, security.
Granularity & conversatism also matter.  Clients might want, e.g., dead code elimination, or removing polymorphic calls, or rename methods & calls.  Other people want object granularity – escape info & side-effects (i.e., compilers and JITs).  Also static-vs-dynamic analysis.  Dynamic analyzes typically run the program and produce optimistic results.

Doing careful experiments of 2 analysis w.r.t. accuracy.  Hard part: can’t get a perfect “gold standard” for real programs.  Hard even for a human to inspect a proposed “gold standard” and declare it “gold” or not.  Some special cases are easier: conservative analysis (no false positives) & optimistic analysis
(no false negatives).  Can under- and over-estimate the gold standard, but this still messes with the results (messes with detecting which of the 2 analysis is better w.r.t. accuracy).

E.g., ask the question: is it worth increasing k>1 (i.e., becoming context sensitive vs staying insensitive).  Checking size of computed call-graph – smaller graph is more precise.  Very small increase in accuracy.  Sometimes made things worse – because the original analysis was optimistic sometimes.

Open question: what is the use of an Uber Points-To – it’s definitely beyond the point of diminishing returns for JIT compilers.


Anders Moller – Type Analysis for JavaScript

Not much tool support for catching errors during programming.
No nice IDE support.  Many bugs are browser specific, but we focus on language bugs.

Most JS programs are quite small, so throw a heavy-weight analysis at it.

Object based, dynamic typed, proto-type inheritance, 1st-class functions, coercions, dynamic code construction, exceptions, plus 161 built-in functions.  Types include null & undefined, primitive types, some properties “ReadOnly” and “DontDelete”.

Broken things: tracking down an “undefined” and where it came from, reading an absent variable or absent property, etc.

So do a dataflow analysis over the program, including a lattice, transfer functions, etc.  Handle interprocedural stuff.  Tightly interleaved control flow & data – so need a complex lattice.  

Lattice is roughly as complex as C2’s lattice, with more stuff on the allocation sites and less on the class hierarchy.  Also then model all the abstract machine state (C2 does this as well).  Full context info (total inlining at analysis time except for recursion).

Cliff Notes: Nice idea: for each allocation site, model both the *most recent object*from that allocation site, and also model *all the rest* from that site.  You can do strong-update on the singleton most-recent object, but weak update on all the rest.


Rhodes Brown – Measuring the Performance of Adaptive Virtual Machines

Noticing the badness of the “compile plan” – 1 run in 10 had a >10% performance reduction.

JIKES – always compiled, no interpreter.  On demand at 1st invoke.  Focused recompilation, lots of GC variants.

Points out the places where you get variations – stack-profiles for estimating the call-graph, basic-block counters, etc.

Causes of variation?
So far: GC & inlining
Concerned that i-cache placement is a major issue of instability.

*No* correlation between total amount of compile-time spent vs score.  You must compile the hot stuff, but apparently tons of not-so-hot code is getting compiled and it doesn’t help the score any.


Kurt William – lots of cores for cheap

General discussion round, not really a “talk”.

Expect >32 core/die in 5 yr, >128 core/die in 10 yrs.
What’s “New” and what can we expect in 5 years?

Theory: contention is that there’s nothing new here since the 70’s & 80’s (concerning concurrency).  (general agreement on incremental progress but not breakthrough).

Systems: Lots of libraries (TBB, pthreads), lots of platforms (webservers), all OS’s support multi-core.  JVMs are New in this space.  Might see OS-support for user-level thread scheduling.

Languages: Support been there, but it’s locks & threads – no good.  Support for TM is coming, but will it help?  Then there’s CSP…

Applications: Not much new here…. more games, more interactive more user-interface.  Old stuff: lots of HPC, speech, image.  Matching: easy parallelism; lots of interesting apps here.


Uwe Kastens, Compiler-Driven Dynamic Reconfiguration of Arch Variants

What is “reconfigurable”?  “Usual” approach – HW guys compile to a netlist, do some mapping, place & route, make a FGPA/silicon, while the SW guys write some assembler on the FPGA – then at the end they try to run the SW on the HW.  Goal is to run some complex function faster on a FPGA than in normal ops.

Our approach: read source code, look at program structure, find hot code & hot loops; compiler knows how to switch the hardware between variants; then it compiles the code for a particular FPGA variant for each loop, and reloads the FPGA with a new function between loops.  But need to limit the FPGA to certain variants so don’t actually need to reload the whole FPGA.  

Fixed small set of architecture variants.  Keeps cost to reconfigure the FPGA very low.  Compiler can choose the variants.  Example: FPGA runs either a small CPU or a small SIMD CPU or a MIMD CPU.  Or reconfigure the pipeline structure, or the default register bank access patterns.  In the different configurations can use more or less power for problems that are more or less parallel.  e.g., in SIMD mode have all ALUs active but turn all but 1 decoder.

Pick variants & design space to those known to be efficiently solved by a compiler – gets a fast compile time & good results there (use a human elsewhere).

Can map any arch register to any physical register – so can e.g. have CPUs share registers, or do unbalanced register allocation – if some CPU needs more registers in it’s portion of code than others.  Does some register allocation affinity games, to avoid having to reconfigure register layouts between blocks of code.  

Nicely flexible overall design.  Instructions name arch registers & ops, but also can re-map real registers & ops periodically.  So use targeted registers & ops for a particular chunk of code, then reconfigure when the next chunk of code is different enough from the last chunk- and the cost to reconfigure is lower than running with a less-well-targeted registers & instructions.

Ugh, can’t read data charts – has some 4 embedded programs.  Seeing 30% reduction in execution time AND power cost, by reconfiguring.  Code size is a little larger, because either poor parallization(?) & the reconfiguration overhead.


Jim – Clone Detection

Exact clone’d code, white-space only changes, near-miss clones.
Bug in 1 deserves inspection in all the rest.

Lots of cloning studies out there -esecpially in-depth of the linux file system.  But no wide-range-of-systems studies.  We did a bunch of open-source systems, C, Java, C#.  OS, Web apps, DBs, etc.  Used their own clone-detection, as a combo of AST-detection and text-based detection.

Text-based comparisons are sensitive to formating changes.  But AST parsing is expensive (requires tree compares) and does not find “near miss” clones.  Ended up doing “standard pretty-printing” – which uses the AST to normalize the text.  Then do text-based comparisons.  Also handle small diffs in the middle of clones.

Did 24 systems, 10 C, 7 Java, 7 C#.  Linux 2.6 kernel, all of j2sdk-swing, db4.  Varied from 4KLOC to 6MLOC.

Java methods have more cloning than C methods – 15%-25% (depending all the diff threshold) of Java methods are clones; C varies from 3% to 13%.  C# is more like Java, until the methods allow more diffs – and then there is a lot more cloning.  Swing is hugely clone-ish, >70%.

C systems – Linux has the same cloning behavior as postgres & bison & wget & httpd. Java has no more clones as you allow more diffs as compared to C – I suspect good Java refactoring tools detect & remove clones earlier.

C# shows a LOT more clones as more diff’s are allowed.  Maybe cut-n-paste is a coding style?  C#’s clones are larger methods as well.

Something like 10-20% of all files in Java have 50% of functions are clones.  In C# it’s more like 20-35% of all files.

Files with 100% cloned code: 15% of C# files are totally cloned.  For Java whole file cloning is fairly low.

Localization of cloned pairs: are they in the same file?  same directory?  different directories?


Thomas Gschwind – Business Process Modeling Accelerators

Business Process Modeling, patterns, x-forms, refactorings
Process Structure Tree – 

 – Process is a large unstructured graph.  Visio-style graphical is error prone for large process models.

But can take the structure graph and build a tree from it.
(The graph is all fork/join parallelism).
A small part of the tree requires state-space exploration.

After the tree is correct, can apply patterns & Xforms to the tree.

Essentially Petri Net edit’ing on a structured graph/tree using Eclipse.

100’s of users inside of IBM; generated business models are much higher quality, make them 2x faster.


Philip Levy, The New Age of Software Attacks

Why? 

  • Not enuf to do?
  • Crime is Big Business
  • Usable exploit: can be sold for $75K to $100K
  • So called “security researchers”.  Some are respectable, most are going after the $100K payoff.


“Visual Virus Studio” – actually exists.  PDF is a common vector for *targeted*attacks.  “Adobe Reader is the new Internet Explorer”.

Looked at all the fixed vulnerabilities in Adobe Reader

  • Need to do a better job teaching students on how to write secure software
  • Stop writing in assembly language
    • ANY unmanaged code, include C/C++, should be banned

  • Languages need to be upgraded to be more secure
    • New constructs & type systems

  • More sophisticated testing tools & methodologies
  • Better analysis tools for legacy systems


How big is Adobe Reader?

  • 100K FILES
  • 42MLOC, no comments or blank lines: 25MLOC
  • Estimated 75 man years of development per year over it’s lifetime


Problems are clustered – e.g. one module has had 25% of all security problems.
Top 6 modules have 80% of security problems.

Break down problems another way:

  • array index OOB – 32%
  • unknown – 27%
  • intege overflow – 14%
  • range check missing – 8%
  • capability check missing – 4%
  • out of memory 3%


How found:

  • Fuzzing – 42%  (creating illegal inputs and throwing them at the program)
  • Code Review – 35%
  • External report – 12%


Some examples:

  • asserting against null, but no asserts in debug build and null possible in release build.
  • Failed to check for error returns, sometimes even not catching return value
  • Casting “attacker controlled values” (values from the PDF file) to valid enum
  • Or using attacker-controlled value for loop & array bounds
  • Sometimes a complex pre-condition needs to be checked first
  • Some of the fixes are still incorrect


A Quick List:

  • Badly designed lib funcs (strcat, sprintf)
    • Microsoft’s banned API list is 4 pages

  • AIOOB
  • Calling thru func ptrs
  • Integer over/underflow.  Many protocols send N followed by N bytes.  Send a really big number that overflows (after scaling for malloc) into a small number – then malloc succeeds, but the data overwrites the buffer.
  • Pointer math
  • Uninitialized vars
  • Dangling ref pointer

Basically: it’s BASIC STUFF that goes bad, not fancy stuff.

All memory integrity or soundness type stuff.
But it goes beyond AIOOB – 
e.g. Name string like “Robert); DROP TABLE Students;– “


Jeff Fischer, Solving the “Britney Spears Problem”

Users assigned roles; roles are assigned privileges. 
Benefit: users can come & go without changing policies.

“Patients” can view medical records, but “doctors” can change them.

Can add annotations to Java that specify roles.

Problem: lack of expressiveness; solution: logging.  Violation happened (people saw Britney Spears’ data), but the logging allowed catching the violators.  Another solution: manual checks.  But no compiler support, so missed checks.

Problem: Lack of explicit access policies.
Really: need a way to “type” data (and hence methods using the data) with some kind of security/access policy.  More problems: often checks are hoisted to entry point for efficiency, and then violate policy later.  Relying on manually-inserted runtime checking.

Our stuff: parameterize classes by some kind of index value defining Role. 
Each method includes an explicit policy (statically checked).
Annotation processor can either statically type-check the policy, or insert runtime checks.


Jens Knoop – Squeeze your Trusted Annotation Base

“Interesting program properties are undecidable!”

Compiler optimization – not so worried: just lose some optimality.  Results still usable; but if we have better results we can produce better code.

Worse-Case execution time analysis for hard real time: Bang.  Your dead.
WCET bounds must be safe (soundness), and wish to be tight (optimal or useful).
E.g., a bound of 1hr for a mpeg3 audio-player interrupt is safe, but useless.
Need, e.g., all loop bounds.  
Need user annotations.
User annotations are untrustworthy, but must be trusted.

Squeeze the trusted annotation base – reduce user chance of screwing up.

Try to replace trust by proof – as a side-effect generally improving optimality.  Then take advantage of trade-off between power & performance of analysis  algorithms.

Use model-checking with care.  If something can’t be bounded automatically, as
the user for some bound – then prove it.  BUt also believe the user’s bound isn’t tight.  Try to prove a tighter bound with the model-checking using binary search.  Indeed – can do away with user assistance often; just guess and binary search to tighten it.

Summary: can often completely empty the trusted annotation base, usually shrink it; often improve WCET bounds.

Start with the cheapest techniques (constant prop), move up to interval analysis, then symbolic bounds & model checkers…


Joe Newcomer – Who do you trust to run code on your computer?

(Joe is a Windows attack expert)

Attack Vectors – autoexec.bat, .cshrc & friends, autoplay, device-driver installs (Sony), mac resource attacks

Engineered attacks: phishing attacks, downloads, activeX, client-side scripting

Lots of “security” designs & codes are done by junior programmers w/out adequate oversight.  Pervasive attitude: “software is secure – as designed, as implemented, as maintained”.  Small+fast is all that matters…  not…

Rant against all modern OS’s, talking about #of security patches in Mac, Linux, unix, Windows…

Client-side scripting is vulnerable. Flash, Office, Adobe Reader, etc.

What are we teaching people about security?  How about non-CS types? 
Graphics designers, website designers…


Stefan Jahnichen, Swarm Systems in Space

Build very small satellites, swarms of them to watch the Earth.  They will spread out after launch to watch e.g. weather, bush fires, etc.  Takes about 2 days after launching to get a communication link.

Satellites have good earth coverage; orbit at 500 to 600km, orbit earth every 3 hrs or so.  Wide range of sensors amongst different satellites.

Have a Real Swarm of cameras; map a Virtual Swarm onto that; map mission goals on the V.S.  

Optimized synthesis of consensus algorithms.

Special OS – distributed OS basically.  Used to maintain swarm integrity (keep satellites together).  


Peter Welch, Engineering Emergence…

Thesis: in the future, systems will be to complex to design *explicitly*

Instead engineer systems to have the desired properties “implicitly”.

“mob programming”, ant mounds, etc…. simple rules of behavior leading to complex behaviors that emerge.  Emergent Behavior

Mechanisms Design – (game theory), 

  • Rational actors have local private info
  • Emergent: optimal allocation of scarce resources
  • Optimal decisions rely on truth revelation


Swarming behavior (flocks, etc)

  • local actors, local behaviors
  • Emergent “swarm” behavior
  • UAV swarms & autonomous robots


Social communication (gossip, epidemic algorithms)

  • large, ad hoc networks
  • emergent: min power to achieve eventual consistency
  • low power, low reliability sensors & data propagation


Design for emergent behavior

  • occam process per “location”, a 1024×1024 grid has 1mil processes
    • each location has a com channel to it’s 4(8) neighbors

  • occam process per “mobile agent”
    • agents register with local process (or up to 4 if it’s straddling a local region) & local process tells the agent what’s nearby.

No idea how to design high-level behavior from low-level rules, but busy experimenting and looking for some cases.

Cliff

 

 

DeCapo Trip Report

http://www.cs.tufts.edu/research/dacapo

DaCapo is a small (but growing) research group focused on GC & Memory Management issues.  They meet about every 9 months, and the meeting bounces around depending on who is hosting it.  People mostly present work-in-progress, and the audience is up front with their critiques… often interrupting the speakers or getting sidetracked into loud off-topic conversations, i.e., it’s a fun boisterous crowd.

DaCapo is in the Boston area this year, held at Tufts University’s Medford campus (everyone’s heard of Harvard and MIT, but how about the Berklee College of Architecture, Fisher College, Boston University, Suffolk University and dozens and dozens more? – Boston has more colleges and universities per square mile than any other place I know).  As usual for me traveling to Boston, I skipped the car and just used public transportation – the T works marvelously well (as compared to e.g. VTA; crossing the entire town from Logan Airport stuck way out in the water to Tufts some 8 miles away took 1 subway transfer and 1/2hr, while crossing San Jose on the VTA takes something like 2 hrs). 

Anyways the weather was supposed to be highs of 65 and rainy with lows in the 40’s – so I packed my parka.  Turned out the weather report was a little bit off, with highs in the upper 80’s and sunny.  Man did I roast.  But I got out and enjoyed the sun – we (the entire DaCapo group of 40 or 50 people) walked about a mile to lunch each day, plus I walked to and from my hotel & the T stop (also about a mile).  Enough ruminating, on to the talks!

As usual, I pay attention to some talks and chat w/neighbors through others.  I brutally summarize.  I’m opinionated & narrow focused… so Caveat Emptor.

Change Tracking

What changed?  Who did it?  When?  Points out the flaws of using ‘diff’.

Logical-Structure diff.  Auto-magically spotted systematic change patterns.
It’s a better code-review diff.

Can report diff’s as reports like “for all XXX, replace XXX with YYY except for ZZZ_XXX_ZZZ”, etc.  Vastly better error reporting!!!  Vastly better diff reporting!!!

Cliff Notes: We must get this gizmo and check it out!!!!

Paper report: http://users.ece.utexas.edu/~miryung/Publications/icse09-LSdiff.KimNotkin.pdf
Sample output: http://users.ece.utexas.edu/~miryung/LSDiff/carol429-430.htm

Inferred Call Path Profiling

Very low cost path profiling.  Hardware counters have very little cost, but provide little context.  Software is reversed: high context & high cost.
Cliff Notes: Compare this to call/ret RPC hashing.
Cliff Notes: Bond & McKinely does something similar.

Instead, look at number of bytes from main() to leaf call on stack.  Basically PC & SP (offset from stack base) provides all the context we need.  Dynamically detect unique PC & SP (they did it in a training run) pairs, and reverse them to get a call-stack.

Lots of duplicated stack heights, how to tell them apart?  On SpecInt unique about 2/3rds of time (2/3rds of PC/SP pairs map to a single call stack).  But huge spread; really sucks bad on some benchmarks.

If have a list of entire call-graph, then can remove ambiguity by growing some frames.  But requires some global analysis to know which frames to grow in order to be unique.  “Solved” problem with a simple greedy search.  Compute all call-stacks over whole program run, pick a conflict, grow some frame, re-compute & keep if more PC/SP pairs are distinct.  Repeat until it’s rare enough that can live with dups.

Cliff Notes: The whole-program ahead-of-time thing isn’t feasible for us.  How to do this on-the-fly during tick-profiling?  Hash PC/SP, find a tick record, 1-time-in-1024 check for dups (crawl stack the hard way and confirm call-stack), otherwise assume unique & increment tick record.  If MISS in hash table, then crawl stack the hard way.  If record is labeled as having dups, then crawl the call-stack & find true record?)  Actually, probably expect the call stacks to be common, but the PC’s to be fairly unique.  So really asking for unique RPC+SP (not PC+SP).

His experiments: collect PC/SP pairs about 100 times/sec.  0.2% slowdown (must have had careful measurements to detect that!)

 

Breadcrumbs – Recovering Efficiently Computed Dynamic Calling Contexts

Also builds on probabilistic calling contexts. 

Given a (nearly) unique ID (say, computed at each fcn entry via a simple hash on the return PC).  IN the past, reversing these unique ids: use a dictionary (very large & high cost), use a calling-context-tree (smaller, but requires you to track where you are in the CC-tree at runtime).  Forward search, but undecidable and limited options for pruning.

Attempt to reverse the hash fcn (but only when we need to take the semi-unique id) and convert it to a call-stack.  Straight-up reverse is no good, but is easy to guide search.  Callers of leaf call themselves have structure – e.g. expect to see a call-op from the call-site, etc.  Still need a list of all call-sites (generally available anyways because of inline-cache structures).  His function is “p’ = (3p+c)”, where “c” is RPC?  Requires a handful of instructions on function entry.

Cliff Notes: If this reverse search is successful most of the time, it says that during a profile tick we can record just the PC/SP (or RPC/SP/PC?) and then reverse them to the entire call-stack later – on demand when RTPM requests.  So a ‘tick’ is no more than record a few bits (and handle buffer overflow), but no stack crawl.

Cliff Notes: combined nicely with ‘null-assignment-stack’ info, ie. collect the entire stack and ‘color’ all NULLs with a unique stack id.  Then if get a NPC can report both the NPC at the crash site, but also the call-stack that assigned the NULL in the first place.

 

Transparent Clustering – Phil McGachey, Eliot Moss, Tony Hosking

Attempt to auto-distribute multi-threaded programs.  Pure Java, no JVM hacks.  No programmer involvement.  Dynamic class re-writing.  JVMs can vary.  Hardware can vary.

All machines on a network run RuggedJ nodes.  RuggedJ includes a rewriting class loader, some runtime libs, and some application-specific partioning strategy.

Rewriter: generate lots of classes & interfaces; especially hook methods & fields.  Hook into the runtime.  Abstract over object location; remote & local objects.

Ex: Standard class “X”.  Rewrite into a X_Local and X_Stub classes.  X objects on Node A are of type X_Local.  On Node B, X objects are really X_Stub objects where all the calls are remote calls.  Both X_Local and X_Stub implement the “X” interface.  Some other Y_Local can refer to X_Local on Node A, but refer to X_Stub on Node B.  If want to migrate from Node A to Node C, want to indirect the X_Locals; so for mobile objects there’s an indirection layer called “X_Proxy”, that then be switched from pointing to an X_Local vs an X_Stub (if the object migrates).  Non-mobile objects skip the X_Proxy indirection layer. 

Inheritance & interfaces are maintained in the rewritten classes.  Static classes conform to the RuggedJ model (I assume replication is allowed?).  Arrays are also wrapped to support indirection. 

Constraints: Native code breaks everything.  Try to not rewrite stuff referred to by native code (maybe can intercept all native calls and rewrite args?).  System code is also an issue: loaded by the system loader, is not rewritten.  (Can they use -Xbootclasspath?).  But most code run is actually system code!  (eg. String libs are very common!)

So must deal with system code without changing the JVM. 
Can wrap system class objects.  Must unwrap when passing to system or native code; must re-box on return.  Re-wrapping is expensive: need the exact same wrapper, so need some kind of hash-lookup to do re-wrapping.  High overhead, so avoided where possible.
Extend: some classes can be extended and preserve the original class & signatures.  Extended class follows the RuggedJ model but also the original system model. No need to unwrap/re-wrap.  Does not work e.g. when system code generates the object (since sys code makes instances of the original and not of the extended superclass). 
Some system classes are only ever referenced from user code; can just dup the code out of rt.jar and make a user-class (which now gets rewritten by the normal RuggedJ model). 
Immutable classes.  Just replicate.

Apparently able to cluster some complex programs.
No idea about performance as compared to explicitly clustered apps.

 

Jikes RVM Native Threads

Was Green threads, now Native threads
(everybody else is dropping green threads too….)
Motherhood & Apple Pie.

 

Computer Science as an Experimental Science

Reproducibility?  Experimental bias?
Show how these problems are solved in very-high-energy astrophysics.

(1) No publish without multiple colleague confirmation.
(2) Publish
(3) MUST also now publish reproductions or non-reproductions’s of other peoples’ experiments (or lose credibility as a research entity).  But bar to publication is lower, and expectations are lower.  i.e., people expect to publish in-progress work

Biggest comp-sci problem now: lack of infrastructures.
Examples of infrastructure bias: green-vs-native threads, some barrier styles are hard to use (.Net embedded objects), stack-scan (OVM makes fast stack scanning at the cost of everything else being slower).  HotSpot (high-value compiler, good GC, good most everything) vs JIKES.

Cliff Notes: This was actually a really fun talk (and lots of discussion) despite my paucity of notes. 

 

Web 2.0 Profiling

Do web 2.0 workloads differ from legacy, e.g. SpecJVM98, DaCapo suite.
Relies on browser to run Flash, JavaScript, REST, AJAX.
Benchmarks: IBM social networking software; Java PetStore 2.0, AJAX-based RIA using JEE5, uses Lucene search.
Apply workload: viewBlog (from 9 blogs), createBookmark (total 14), selectTag (PetStore 9), etc. 
Workload mix can be varied, but based on commonly observed usage pattersn from internal deployment of Lotus Connections).

Wanted to remove DB from the workload, put DB on ramdisk.  Same for LDAP.  Used Grinder to generate load.  Main server running J9/WebSphere/Lotus and also Sun Glassfish/Petstore both on 2-way power5 1.66GHz.  Fairly reasonable setup for a small web/app load.

See lots more chattiness; very short fast requests, high interrupt rates, smaller requests but many more of them.  Looked at low-level profiling.  See fewer I-cache misses as compared to a large legacy apps.  Better I-cache behavior.  But data-cache miss rate is still very significant.

Also different transactions have different scaling limitations.  For PetStore the Java persistence layer started falling off after 4 cpus.

 

Stream Processing

“Spade” – Cluster stream processing at IBM.  Continuous high-level streams.  Stream arrives for weeks & months; want to process and then emit stream data continously.  Want to express the problem with a language.

Priorities – fast streaming on a cluster; then Generality; finally Usability.  Beyond StreamIt.  Like StreamIt, works with streams that can be split & joined.  Language is typed; stream contents are strongly typed. 

 

Demystifying Magic – High-Level Low-Level Programming

Seen this talk before, but this time given by Steve Blackburn.  Nice historical perspective on writing system software in C vs asm – and the parallels to writing sys software in Java vs asm.

Java-In-Java
Key Principle: Containment.  Write as much as possible (99%!) in pure Java, and dive into the low level “magic” bits as little as possible.
Extensibility: Requires change quickly, languages change slowly.
Encapsulation: contain low-level semantics
Fine-grained lower of semantics: minimize impedance, seperation of concerns

Framework:
Extend Java semantics; intrinsics
Scoped semantics changes
Extend types; un/boxing, ref-vs-value, architecture sizes
Pragmatic: retain syntax where possible to keep front-end tools working

Introduce raw pointer type: “Address” looks like a Java type.
E.g.: Address a = …;   … = a.loadByte(offset);

Semantic Regimes:
Additive: “its ok here to have unchecked casts”
Subtractive: “no allocation via new allowed”, similar to HotSpot VerifyNoGC.

Allow “unboxed” types – no v-table, so it’s a bare primitive type.  But this isn’t the default (but can be asked for).

Abstraction results are good: can implement a JVM on bare hardware and with a simple change essentially virtualize the same JVM inside a GC debugging harness running inside another JVM (where the virtualized JVM’s heap is just a big array in the outer implementing JVM), etc.

 

Grace, Safe Multithreaded Programming for C/C++ – Emery Berger

 

Seen the paper before, but this is the talk by Emery.

Fake sequential semantics on stock multi-threaded hardware using fork/join and process protection. 

Grace is a pthreads replacement: “g++ -lpthreads” becomes “g++ -lgrace”.

Speedups: in the absence of common benign races, Grace run programs run about 10% slower than raw pthreads – but have full sequential semantics.  It’s “as-if” all the parallel programs are run sequentially immediately at the thread spawn point. 

CAN’T run applications where threads run forever, i.e. reactive or server apps.
Works well with fork/join, Cilk, TBB, etc.

So thread spawn becomes a unix fork with COW, use mmap for allow memory sharing.  At join points, smash in join’d threads memory updates via mmap.  Also need scalable thread-local malloc heap, plus aligned globals (to avoid false-sharing at the page granularity), plus some improved I/O.

Some simple measures remove nearly all false sharing.  Big one: everybody mallocs into their own space.  2nd big one: spread the globals out 1 per page.

Thread-spawn is as cheap as a fork on linux (has experiments to show it).  Due to thread-cpu affinity, if you spawn a bunch of threads they share a single CPU.  At the low end of thread-grain-length, fork is faster than spawn because the scheduler immediately spreads the processes across CPUs, whereas threads share for awhile.  So fork is actually faster than thread-spawn for awhile.

Real performance limiter for Grace is conflicts & rollbacks, and not thread-spawn overheads.

Performance is much better than e.g. STM, because after the 1st touch on a new page (and the SEGV & accounting), all accesses on that page run at full speed.

 

GC Assertions: Using the GC to check heap properties

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.3949&rep=rep1&type=pdf

Global knowledge is required to understand seemingly local properties.
JBB example:

    Order order;
    for(…)
      order = ….;
    delete orderTable;
    ? are all ‘orders’ dead here?

But actually ‘order’ is held by the last-customer-transaction (as an optimization, or convienvence for customer?).  Leads to a leak of ‘orders’.  Really: programmer does not understand program, too complex.

Add assertions to the code, and use the GC to check the property.

Cliff Notes: Azul is already gathering global heap properties during GC – but not asserts.  Gathering liveness & points-to info.  If points-to included a handful of sample real instances (and all the sames linked when possible), would be a very powerful way to instantly check some heap properties.

Sample asserts:
‘assert_dead(p)’ – expect to reclaim ‘p’ at the next GC cycle
Or region-collection: start_region(); … assert_all_dead();
Or shape property: assert_unshared(p);
Or ownership properties (members of a collection are ‘owned’ by the collection)

If an assert fails, then do a slow crawl and provide the full path through the heap showing the failure.   Asserts only checked at GC points.

 

Proving Proving Correctness of Abstract Concurrency Control and Recovery – Trek Palmer & Eliot Moss

 

Transactions – closed nesting has issues
Open nesting & Boosting need from programmer: conflict info & roll-back info.
These are hard to provide correctly.

This work: a description language of abstract data types or structures.
Can describe the conflict predicates & inverses.
Can prove correctness of the conflict & inverse expressions.

Working with the abstract description of the data structure, and NOT e.g. with the real Java implemention.  E.g., no loops, no mutations, no recursion.  So the language isn’t a ‘real’ language, but can describe many kinds of ADTs in code that looks sorta like Java.

Output isn’t a functional program, instead the output is the result of using a SAT solver to prove correctness. 

XTN-semantics allows conflict detection to be proven correct pairwise, instead of having to do full interleaving.  Formally, a conflict
predicate is correct iff it is true when the operations do no commute.

The conflict predicate tells when 2 XTNs conflict, and the inverse allows an optimistic XTN to rollback.

Obvious use-case: use this tool to write a transactional-version of NBHM or other JDK concurrency utilities.

 

Hard Real-Time GC Scheduling

– Periodic Scheduling
 – GC runs at highest priority, but periodically yields to mutator
 – Metronome
– Slack-based Scheduling
 – GC runs at lowest priority
 – Can be preempted at any time
– Makinac (Sun RTS)
– Work based Scheduling
 – GC runs at allocation time
 – Problem with allocation-rate jitter

HRT – Deadlines must never be missed AND must be verifiable analytically that no deadlines are missed.  Systems tend to therefore be very simple, so that they can be proven.

OVM – like Metronome.  Dynamic defrag; arraylets; Brooks style barrier; replicating barrier/; incremental, concurrent, supporting slack-based style scheduling.  

 

Understanding Performance of Interactive Applications

Typical profilers give the wrong numbers: they report total time spent, not time spent between mouse-click and page updated.

AWT/Swing & SWT – similar: single-threaded, gui thread in “event loop”, relies heavily on Listener model.  So profile using gui call-back Listeners between user events and the eventual display change.

Really simple idea: profiling & event visualizer tool between click & view times.

Program Metamorphsis

Trouble w/refactorings: must preserve program behavior with each step.  But if we want multiple refactorings then at each step along the way fixup code (needed to preserve program behavior) pollutes the code.  Want to do multiple refactoring steps at once – while leaving the program possibly broken in the in-between steps.

So compute a program-semantics (names & def-use chains), and let the user make partial refactoring steps and then declare “I think I’m done” – and the system compares the new program-semantics with the old, and reports if they are not equal. 

So actually comes up with primitive not-quite refactoring steps, which he will compose to build larger “real” refactorings, or compose lots to build a multi-stage refactoring. 

Instruction Selectors

HotSpot C2 uses a BURS framework.  Very hard to optimize & debug.  So these guys have a semantics which can span both the ideal IR and real machine instructions.  Describe both using “CISL” and the tool does the mapping.

Once the tool has a mapping, the user provides an adapter that converts the mapping into the code needed by the compiler back-end.  This would replace e.g. the ADLC.  Gives examples of machine encodings for PPC and Java bytecodes.

CISL is given an IR tree and produces a target tree with equivalent semantics.

Not sure it’s entirely better than BURS (maybe no worse), but I’m still sold on rewriting in e.g. C++/Java hand-written greedy match rules.  These code-gen-gen tools are pain for long-term maintenance.

 

Do Design & Performance Relate?

Is fast code required to be ugly?  Is beautiful required to be slow?

Pick 200 code metrics.  Also pick some performance metrics (cycles, single threaded, objects created, etc).  Could not follow talk… or at least he wasn’t very clear with the objectives & results (if any).  Interesting idea though.

 



Comments

 

 

 

Hi Cliff,

Do you have any pointers to the paper — if any — that went with the Rhodes Brown talk? Googling did not yield anything useful 😉

Thanks,
Andy

Posted by: Andy Georges | Jun 4, 2009 12:43:54 AM

 


 

Try: “Statistically rigorous java performance evaluation” and “Wake up and smell the coffee: evaluation methodology for the 21st century” as a start.

Also this paper: “Relative factors in performance analysis” – late in section 5 shows that a minor offset in the VM’s text segment can vary performance by 6%; compiling in Intel HPM code (and not using it) can vary performance even more.

But I can’t find a reference to the result that “changing your environment variables changes your stack/text placement changes performance by 10%” – but I’ve seen the effect & worked on fixing it before (this & a related effect triggered Motorola to do a i-cache code placement optimization in ~1995).

Cliff