An API For Distributed Analytics

There are so many APIs to choose from!

Features of the space:

  • Lots of data – which I’ll qualify as “bigger than 1 machine” and thus needing parallel i.o, parallel memory, & parallel compute – and distributed algorithms.
  • Ease of programming; hide details (but expose when want to).  High level for ease-of-use, but “under the covers” has to be easy to understand- because no tool solves all problems – so expect extensions & frequent 1-off hacks.
  • Speed: In-memory by default, where memory can range from 2G to 2T and beyond.  Data placement is required (do not schlop data about unless needed, move code to data, no disk i/o by default).
  • Speed By Default: the normal/average/typical programming style will be fast.  You can “trip over yourself” and go slow, but normal usage is fast.  Obvious when you’re moving away from “go fast” to “unknown speed”
  • Correct By Default: the normal/average/typical programming style will not be exposed to weird corner cases.  No data-races (unless you ask for them), no weird ordering rules, job-scheduling, nor counting mappers & reducers, no figuring out sharding or data placement, nor other low-level easy-to-get-wrong details.  Resource management simple by design.
  • Accessible for non-expert programmers, scientists, engineers, managers – looking for a tool, not wanting the tool to be more complicated than the problem

 

Design decisions:

Automatic data placement: It’s a hard problem, and its been hard for a long time – but technology is changing: networks are fast relative to the size of memory.  You can move a significant fraction of memory in a cluster in relatively little time.  “Disk is the new tape”:  we want to do work in-memory for that 1000x speedup over disk, but this requires loading memory into one of many little slices in the cluster – which implies data-placement.  Start with random placement – while it’s never perfect, it’s also rarely “perfectly wrong” – you get consistent decent access patterns. Then add local caching to catch the hot common read blocks, and local caching of hot or streaming writes.  For H2O, this is one of the reasons for our K/V store; we get full JMM semantics, exact consistency, but also raw speed: 150ns for a cache-hitting read or write.  Typically a cache miss is a few msec (1 network hop there and back).

Map/Reduce: It’s a simple paradigm shown to scale well.  Much of Big Data involves some kind of structure (log files, bit/byte streams, up to well organized SQL/Hive/DB files).  Map is very generic, and allows an arbitrary function on a unit of work/data to be easily scaled.  Reduce brings Big down to Small in a logarithmic number of steps.  Together, they handle a lot of problems.  Key to making this work: simplicity in the API.  Map converts an A to a B, and Reduce combines two B’s into one B – for any arbitrary A and B.  No resource management, no counting Map or Reduce slots, no shuffle, no keys.  Just a nice clean Map and Reduce.  For H2O this means: Map reads your data directly (type A) and produces results in a Plain-Olde Java POJO’s (type B) – which is also the Map’s “this” pointer.  Results returned directly in “this”.  “This is Not Your Father’s Map Reduce”

Correct By Default: No multi-threaded access questions.  Synchronization, if needed, is provided already.  No figuring out sharding or data placement; replication (caching) will happen as-needed.  NO resource management, other than Xmx (Java heap size).  Like sync, resource management is notoriously hard to get right so don’t require people to do it.  For 0xdata, this means we use very fine grained parallelism (no Map is too small, Doug Lea Fork/Join), and very fine-grained Reduces (so all Big Data shrinks as rapidly as possible).

Fast By Default: Use of the default Map/Reduce API will produce programs that run in parallel & distributed across the cluster, at memory bandwidth speeds for both reads and writes.  Other clustered / parallel paradigms are available but are not guaranteed to go fast.  The API has a simple obvious design, and all calls have a “cost model” associated with them (these calls are guaranteed fast, these calls are only fast in these situations, these calls will work but may be slow, etc.  For 0xdata, code that accesses any number of columns at once (e.g. a single row) – but independent rows – will run at memory bandwidth speeds.  Same for writing any number of columns, including writing subsets (filtering) on rows.  Reductions will happen every so many Maps in a Log-tree fashion.  All filter results are guaranteed to be strongly ordered as well (despite the distributed & parallel execution).

Easy / Multiple APIs – Not all APIs are for all users!  Java & Map/Reduce are good for Java programmers – but people do math in R, Python and a host of other tools.  For 0xdata, we are a team of language implementers as well as mathematicians and systems’ engineers.  We have implemented a generic REST/JSON API and can drive this API from R, python, bash, and Excel – with the ability to add more clients easily.  From inside the JVM, we can drive the system using Scala, or a simple REPL with an R-like syntax.

 

Lets get a little more concrete here, and bring out the jargon –

A H2O Data Taxonomy

Primitives – at the bottom level, the data are a Java primitive – be it a byte, char, long or double.  Or at least that’s the presentation.  Under the hood we compress aggressively, often seeing 2x to 4x more compression than the GZIP’d file on disk – and we can do math on this compressed form typically at memory bandwidth speeds (i.e., the work to decompress is hidden in the overhead of pulling the data in from memory).  We also support the notion of “missing data” – a crucial notion for data scientists.  It’s similar to Double.NaN, but for all data types.

A Chunk – The basic unit of parallel work, typically holding  1e3 to 1e6 (compressed) primitives.  This data unit is completely hidden, unless you are writing batch-Map calls (where the batching is for efficiency).  It’s big enough to hide control overheads when launching parallel tasks, and small enough to be the unit of caching.  We promise that Chunks from Vecs being worked on together will be co-located on the same machine.

A Vec – A Distributed Vector.  Just like a Java array, but it can hold more than 2e31 elements – limited only by available memory.  Usually used to hold data of a single conceptual type, such as a person’s age or IP address or name or last-click-time or balance, etc.  This is the main conceptual holder of Big Data, and collections of these will typically make up all your data.  Access will be parallel & distributed, and at memory-bandwidth speeds.

A Frame – A Collection of Vecs.  Pattered after R’s data.frame (it’s worked very well for more than 20 years!).  While Vecs might Big Data (and thus can be expensive to touch all of), Frames are mere pointers to Vecs.  We add & drop columns and reorganize them willy-nilly.  The data munging side of things has a lot of convenience functions here.

 

The Map/Reduce API

(1) Make a subclass of MRTask2, with POJO Java fields that inherent from Iced, or are primitives, or arrays of either.  Why subclassed from Iced?  Because the bytecode weaver will inject code to do a number of things, including serialization & JSON display, and code & loop optimizations.

class Calc extends MRTask {

(2) Break out the small-data Inputs to the Map, and initialize them in an instance of your subclass.  “Small data” will be replicated across the cluster, and available as read-only copies everywhere.  Inputs need to be read-only as they will be shared on each node in the cluster.  “Small” needs to fit in memory and my example is with doubles, but mega-byte sized data is cheap & commonly done.

  final double mean;   // Read-only, shared, distributed
  final int maxHisto;  // Read-only, shared, distributed
  Calc(double meanX, int maxHisto) { this.mean = meanX;  this.maxHisto = maxHisto; }

(3) Break out the small-data Outputs from your map, and initialize them in the Map call.  Because they are initialized in the Map, they are guaranteed thread-local.  Because you get a new one for every Map call, they need to be rolled-up in a matching Reduce.

  long histogram[];
  double sumError;
  void map( ... ) {
    histogram = new long[maxHisto]; // New private histogram[] object

(4) Break out the Big Data (inputs or outputs).  This will be passed to a doAll() call, and every Chunk of the Big Data will get a private cloned instance of the Calc object, distributed across the cluster:

new Calc(mean,vec.max()).doAll(myBigVector /*or Frame or Vec[] or ....*/);

(5) Implement your Map.  Here we show a Batching-Map, which typically does a plain “for” loop over all the elements in the Chunk.  Access your data with the “at0” (Chunk-relative addressing) call – which is the fastest accessor but requires a chunk (and takes an “int” index).  Plain Vec-relative addressing is a little slower, but takes a full “long” index: “vec.at(long idx)”.

  void map( Chunk chk ) {
     histogram = new long[maxHisto];
     for( int i=0; i<chk.len; i++ ) {
       histogram[(int)chk.at0(i)]++;
       double err = chk.at0(i)-mean;
       sumError += err*err;
     }
   }

(6) Implement a Reduce for any output fields.  Note that Reduce has a “type B” in the “this” pointer, and is passed a 2nd one:

  void reduce( Calc that ) {
     sumError += that.sumError;
     // Add the array elements with a simple for-loop... we use this
     // simple utility.
     histogram = ArrayUtils.add(histogram,that.histogram);
   }

(7) That’s it!  Results are in your original Calc object:

Calc results = new Calc(mean,vec.max()).doAll(myBigVector);
System.out.println(results.sumError+" "+Arrays.toString(results.histogram));

The Rest of the Story

You have to get the data in there – and we’ll import from HDFS, S3, NFS, local disk, or through your browser’s upload.  You can drive data upload from Java, but more typically from R, python, REST/JSON, or Excel.  Same for outputting Big Data results: we’ll write back Big Data to any store, while being driven by any of the above languages. If you build a predictive model, you’ll want to eventually use the model in production.  You can use it in-memory as-is, scoring new datasets on the model – and for example constantly streaming new data through the model while at the same time constant churning out new models to be streamed through.  Or you can get a Java version of any model suitable for dropping into your production environment.

And that’s the end of my whirl-wind tour of the H2O Distributed Computing API.  Hope you like it!

Comments & suggestions welcome.

Cliff

TCP is UNreliable

Been to long between blogs…

TCP Is Not Reliable” – what’s THAT mean?

Means: I can cause TCP to reliably fail in under 5 mins, on at least 2 different modern Linux variants and on modern hardware, both in our datacenter (no hypervisor) and on EC2.

What does “fail” mean?  Means the client will open a socket to the server, write a bunch of stuff and close the socket – with no errors of any sort.  All standard blocking calls.  The server will get no information of any sort that a connection was attempted.  Let me repeat that: neither client nor server get ANY errors of any kind, the client gets told he opened/wrote/closed a connection, and the server gets no connection attempt, nor any data, nor any errors.  It’s exactly “as if” the client’s open/write/close was thrown in the bit-bucket.

We’d been having these rare failures under heavy load where it was looking like a dropped RPC call.  H2O has it’s own RPC mechanism, built over the RUDP layer (see all the task-tracking code in the H2ONode class).  Integrating the two layers gives a lot of savings in network traffic, most small-data remote calls (e.g. nearly all the control logic) require exactly 1 UDP packet to start the call, and 1 UDP packet with response.  For large-data calls (i.e., moving a 4Meg “chunk” of data between nodes) we use TCP – mostly for it’s flow-control & congestion-control.  Since TCP is also reliable, we bypassed the Reliability part of the RUDP.  If you look in the code, the AutoBuffer class lazily decides between UDP or TCP send styles, based on the amount of data to send.  The TCP stuff used to just open a socket, send the data & close.

So as I was saying, we’d have these rare failures under heavy load that looked like a dropped TCP connection (was hitting the same asserts as dropping a UDP packet, except we had dropped-UDP-packet recovery code in there and working forever).  Finally Kevin, our systems hacker, got a reliable setup (reliably failing?) – it was a H2O parse of a large CSV dataset into a 5-node cluster… then a 4-node cluster, then a 3-node cluster.  I kept adding asserts, and he kept shrinking the test setup, but still nothing seemed obvious – except that obviously during the parse we’d inhale a lot of data, ship it around our 3-node clusters with lots of TCP connections, and then *bang*, an assert would trip about missing some data.

Occam’s Razor dictated we look at the layers below the Java code – the JVM, the native, the OS layers – but these are typically very opaque.  The network packets, however, are easily visible with wireshark tools.  So we logged every packet.  It took another few days of hard work, but Kevin triumphantly presented me with a wireshark log bracketing the Java failure… and there it was in the log: a broken TCP connection.  We stared harder.

In all these failures the common theme is that the receiver is very heavily loaded, with many hundreds of short-lived TCP connections being opened/read/closed every second from many other machines.  The sender sends a ‘SYN’ packet, requesting a connection. The sender (optimistically) sends 1 data packet; optimistic because the receiver has yet to acknowledge the SYN packet.  The receiver, being much overloaded, is very slow.  Eventually the receiver returns a ‘SYN-ACK’ packet, acknowledging both the open and the data packet.  At this point the receiver’s JVM has not been told about the open connection; this work is all opening at the OS layer alone.  The sender, being done, sends a ‘FIN’ which it does NOT wait for acknowledgement (all data has already been acknowledged).  The receiver, being heavily overloaded, eventually times-out internally (probably waiting for the JVM to accept the open-call, and the JVM being overloaded is too slow to get around to it) – and sends a RST (reset) packet back…. wiping out the connection and the data.  The sender, however, has moved on – it already sent a FIN & closed the socket, so the RST is for a closed connection.  Net result: sender sent, but the receiver reset the connection without informing either the JVM process or the sender.

Kevin crawled the Linux kernel code, looking at places where connections get reset.  There are too many to tell which exact path we triggered, but it is *possible* (not confirmed) that Linux decided it was the subject of a DDOS attack and started closing open-but-not-accepted TCP connections.  There are knobs in Linux you can tweak here, and we did – and could make the problem go away, or be much harder to reproduce.

With the bug root-caused in the OS, we started looking our options for fixing the situation.  Asking our clients to either upgrade their kernels, or use kernel-level network tweaks was not in the cards.  We ended up implementing two fixes: (1) we moved the TCP connection parts into the existing Reliability layer built over UDP.  Basically, we have an application-level timeout and acknowledgement for TCP connections, and will retry TCP connections as needed.  With this in place, the H2O crash goes away (although if the code triggers, we log it and use app-level congestion delay logic).  And (2) we multiplex our TCP connections, so the rate of “open TCPs/sec” has dropped to 1 or 2 – and with this 2nd fix in place we never see the first issue.

At this point H2O’s RPC calls are rock-solid, even under extreme loads.

UPDATE:

Found this decent article: http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable
Basically:

  • It’s a well known problem, in that many people trip over it, and get confused by it
  • The recommended solution is app-level protocol changes (send expected length with data, receiver sends back app-level ACK after reading all expected data). This is frequently not possible (i.e., legacy receiver).
  • Note that setting flags like SO_LINGER are not sufficient
  • There is a Linux-specific workaround (SIOCOUTQ)
  • The “Principle of Least Surprise” is violated: I, at least, am surprised when ‘write / close’ does not block on the ‘close’ until the kernel at the other end promises it can deliver the data to the app.  Probably the remote kernel would need to block the ‘close’ on this side until all the data has been moved into the user-space on that side – which might in turn be blocked by the receiver app’s slow read rate.

Cliff

 

Captain’s Log Days 11-19

Captain’s Log Day 11

It’s another long drive day for us, we’re trying to get from Stone Mountain (near Atlanta) to Harrisburg, PA today – and Chaplain CT sometime tomorrow.  We’re quite expert and breaking camp by now; it takes maybe an hour to pull up all the sleeping bags and fold all the couches and tables back out, to shower and freshen up, to reload fresh water tanks and dump the other tanks.  We spend another hour in a local Walmart replacing basic supplies and then we’re on the road.

The kids have figured out how to keep themselves busy on the drive.  We’ve got a TV and a Wii, and some amount of reading.  There’s singing and tickle fights, and lots of napping.  There’s food-making and grumbling about dish cleanup.  We camp out in the middle of Pennsylvania.  We pass the 3500 miles traveled mark, the 1/2-way point.

Captain’s Log Day 12

We break camp at daylight without waking the kids, and drive maybe two hours before the kids bother to roll out of bed.  RV “camping” is a real trick.  We make it around New York with only 1 truly crazy driver incident; a bright red pickup truck came blazing up the left side and was clearly out of room to pass us, but did so anyways.  He sliced across at a 45-degree angle in front of us. Had I not slammed the brakes and swerved we clearly would have hit the truck; and such a hit would have rolled him.

We finally pull into my Uncle Bill’s farm in Connecticut around 4pm.  We settle the RV, then meander down to the river behind the farm, where one of my cousins is RV camping.  We swim in the river, cook burgers on the campfire and sit around visit until way past dark.

Captain’s Log Day 13

We hang out in the farm all day; some of the kids swim in the river or fish or shoot fireworks off after dark.  I mostly hung out and caught up with the family news.  Shelley & I attended the local church wine-tasting, which was basically a chance to drink a bunch of wines that somebody else bought, and do more catching up on family news.

 

Captain’s Log Day 14

Shelley & I borrow a cousin’s car and drive to Cape Cod for the day.  OMG’s a car is SO much nicer to handle than Nessie!  We take the slow route up the Cape stopping at every tiny town and inlet.  Shelley’s family owned a summer house in Dennis Port 50 or 60 years ago and Shelley was tracing her roots.  We managed to stick our toes in the Atlantic and really unwind.  Shelley & I both like driving, so it’s another really peaceful down day.

 

Captain’s Log Day 15

Up early, we force all the kids to take showers (and change clothes; 2 weeks into vacation and our standards are getting pretty lax) and we hit the road.  Breaking camp is now a pretty standard operation.  By rotating drivers and Shelley driving until the wee hours we make it almost to Indiana.

 

Captain’s Log Day 16

We pull into the University of Illinois at Urbana-Champaign around noon.  I’m giving at talk at 6, and UofI is paying for dinner and 3(!) hotel rooms for us (one for each couple, and one more for the 3 kids).  Real showers for all again!  Yeah!!!  The talk goes really well, its my Debugging Data Races talk and its a good fit for the summer course on multi-core programming.  Shelley and I manage to sneak a beer afterwards.

 

Captain’s Log Day 17

Again we manage to break camp in short order and do another long day of driving through Illinois, Iowa, and Nebraska.  By now we’ve got a rhythm going; Shelley takes the early morning driving shift while everybody sleeps in, then Luke and I alternate shifts until evening (while Shelley naps), and Shelley takes the late night shift.  I think we’re covering around 800 miles in a day.

 

Captain’s Log Day 18

Today it’s the rest of Nebraska and Wyoming, then Utah.  My Dad manages to call me out in the middle of I-80 no-where land, to the bemusement of all.  We hit high winds on and off all day.  At least once I was driving with the steering wheel cranked over a full 180 degrees (and was down to 45 mph) just to stay on the road.  18-wheeler’s would blow by us, knocking us all over the road.  First the bow wave push us hard to the right, on to the shoulder.  Then the wind-block (and my 180-degree wheel position) would drive us hard back onto the road and into the truck, then the trailing suction would pull us harder into the truck – even as I am cranking the wheel the other way as fast as I can… and then the wind would return.  It was a nerve-wracking drive.  Shelley took over towards evening.  Around 11pm the winds became just undrivable even for her.  I was dozing when suddenly we got slapped hard over, almost off the shoulder.  Even driving at 40mph wasn’t safe.  An exit appeared in the middle of nowhere – even with an RV park (mind you, it’s typically 30 miles between exits *without services*).  We bailed out.  All night long the RV was rocked by winds, like a Giant’s Hand was grabbing the top of Nessie and shaking her like a terrier does a rat.

 

Captain’s Log Day 19

Morning dawns clear and still.  We hit the road again early, as we’ve a long drive today.  It’s a quiet drive through to Reno, and then we hit some really crazy drivers again – a combo of construction zone, short merge lanes and stupidity (outside the RV) nearly crushed a darting micro-car.  The construction on the Donner Pass was perhaps even worse; we managed to get forced into clipping a roadside reflector on the right (less than a foot away from the mountain stone versus pushing an aggressive SUV into the naked concrete on his left).  Finally past all the madness we get to the clear road down from Tahoe and through the Bay Area – but it’s all Homeward Bound on the downhill slide through our home turf!

Home At Last!!!

—-

Some parting stats:

We passed through 22 states (24 for Shelley & I, as we also get to count Rhode Island and Massachusetts).
We drove about 6900 miles.
I bought about $3000 in gas, and $1300 in tires.
We saw 4 close family members in Tucson, 7 in Texas, my brother in Atlanta, and at least 16 in Connecticut (I lost the exact count!).
I did about 20 loads of laundry after returning (the washer ran continuously for 2 days).

Cliff

 

Captain’s Log Days 8, 9 & 10

Captain’s Log Day 8

We’re on the road by 10am, this time a full day’s drive to Montgomery AL from Katy TX.  I forget how big Houston freeways are; at one point I count 9 lanes *in each direction* (18 total lanes!).  I’ve never seen so much concrete.  It’s otherwise mostly uneventful, though.  Traffic is fair to light and the road is good.  We stop at a random lakeside park by Lake Charles for lunch.  It smells of the ocean and has an alligator pond/cage/viewing area.

While I typically encourage the kids to drink a lot (to survive the desert heat & dry), I don’t check on how much they eat, just that they eat a reasonably balanced diet.  So I missed out that Matt hadn’t eaten all day, and was constantly staring heads-down on his IPod on silly flash games.  Well, towards afternoon he starts feeling sick, and near dinner he barfs and refuses to eat or drink anything.  He cannot even keep down tiny bits of bread or Gatoraide; the 2nd barf happens on our bedsheets and pillow.  At this point he decides to camp out by the RV toilet and do any more barfing into that (uggh!!!  poor guy!!!!), and we decide to cut it short and look for camping for the night.  By dinner he’s still unable to keep anything down; we grab to-go food from a collection of fast-food joints and keep rolling to the nearest campsite.

We get 1/2 way between Mobile & Montgomery, AL and pull over into a nice full-service RV park.  Shelley & I decide to camp outside in a tent, so Josh can get off the floor (he’s 17 and 6ft tall, lean and flexible… and does not fit in any of the RV pullout/fold-down beds, so he’s been sleeping in the aisle).  We want Josh off the floor so Matt can make an emergency run from his foldout bed to the bathroom without interference.  It’s beastly hot and humid outside, but I figure it will cool off as the night wears on.  Boy was I wrong! It remains 80+ & 80% humidity all night long outside, while the kids were sleeping in air-conditioned luxury.  And we get a late night visit from the camp kitten – he’s adorably cute and caterwauls at us, and starts climbing the tent with his razor claws until Shelley takes him for a walk.  He follows her like a shadow all over the park until she finally has to lock him in the campground bathroom.

 

Captain’s Log Day 9

Finally dawn breaks and we move back into the cool RV air.  Ahhh, blessed relief.  Also, Matt is much better – it’s a common kid 24-hour tummy bug.  I start him back in on the BRAT diet, with sips of water – and now he’s very hungry, a good sign.  He continues to improve throughout the day and is eating normal by dinner.  We pull up camp (we’re getting quite expert at this) and head for Stone Mountain, GA.

Stone Mountain is a giant mountain-sized chunk of granite outside of Atlanta, with a park and a lake.  It’s been carved with a 50ft high sculpture and has been slowly improved over the years to include many hiking trails, a sky tram system, lots of outdoor adventure activities and an amusement park.  Apparently the “ducks” (amphibious vehicals) are fantastic.  We are going there for the July 4th extravaganza – and as a sign that I’m on vacation, I barely know that today is the 3rd and I’ve no idea what day of the week it is.  We get there about 3pm and check in to a nice RV camp site.

Shelley cooks a fantastic spaghetti dinner.  My brother Eric drives out to camp with us, bring his best friends’ two small girls (ages 6 & 7) with him (he’s been watching the girls when the parents are working since they were 2 & 3) and we all enjoy a nice picnic dinner.  As the evening rolls on we’re deciding on whether or not to see the laser & fireworks show this evening (there’s a bigger one tomorrow) – when the thunderstorm hits.  It’s a real downpour, big lightning and thunder, blowing wind, the works.  We wait that out, and then try to take a walk about the park.  Eric & I, the two girls and my middle two kids walk over to the clubhouse (to check out the water-taxi ride to the main park area) but the rain has other ideas.  We make it to the clubhouse but we’re fairly wet, so we treat the girls to hot chocolate while we dry out.  We wait for the rains to end but it’s no good – the rain has turned into a steady drizzle; we just as wet by the time we make it back and there’s no end in sight.  We give up any idea of tent camping or seeing the laser show and settle for watching a Disney movie (the Sword in the Stone) and having a lazy evening with all 10 of us huddled in the RV).  Sleeping arrangements are “cozy” to say the least!  But at least everybody is dry.

 

Captain’s Log Day 10

It’s the 4th of July!  We breakfast, cleanup & head over to the water taxi. The rains have stopped and the sun is out.  It’s gonna be a hot & humid day. The water taxi is nice, it’s cooler near the lake.  We make it to Stone Mountain’s main attraction area and decide to walk to the bell tower.  The park is already busier than Eric has ever seen it before.  There’s a large Indian family setup under the bells already (and I see more people of the same persuasion walking over to the tower all morning – I think they figured out a cool shady semi-private place to hang out at all day).

We’ve walked maybe a half a mile and it’s not even noon and we’re already soaked with sweat when we make it back to the Plantation Inn.  The Inn isn’t open for lunch (although the AC is nice), but the helpful counter lady tells us there’s RV parking closer in.  We walk up to Memorial Hall.  Immediately two things strike me as really odd: there’s at least 1000 people hanging around looking for food (and more pouring in all the time), it’s 11:30 and *none* of the dozen or so restaurants are open yet – and there’s bus & RV parking open
right in front of the main Hall.

I hand the kids my credit card (to get lunch at noon when the restaurants open) and Shelley and I hightail it back to the RV: across 1/2mile of hot trails & roads, ride the water taxi (we miss the one in front of us by literally seconds even with me sprinting across the landing area), and finally the 1/4 mile hike from the taxi dock to the RV.  We pull the hookups as fast as we can and roll out & down the road.  Nessie does NOT sprint, she *proceeds*, but we made her proceed as fast as possible.  We took the short way around the lake, only to discover the road was closed: the attendant at the barricades explains “the road fell in a hole”.  Nothing to be had for it; Shelley makes a 3-pt turn on a narrow park road and we go the long way around.  Finally, a full hour later, we make it back to the bus/RV parking in front of Memorial Hall – and Lo! it’s open.  We take the most premier parking spot in all of Stone Mountain, at noon-thirty on the 4th of July.  (A short time later one other RV takes the next spot, then the road is closed behind us).

The amazing thing about the Stone Mountain concessions was the astronomical price for food; hotdogs: $7-$10, drinks also $7 or so.  (And they denied a hot and hungry hoard for at least an hour???).  But finally we all sat down and finished our food and plotted our next move.  Shelley, Eric & I all want a big hike.  Last Christmas Shelley & I hiked the Grand Canyon down to Phantom Ranch and back out in two days, and Eric has hiked both the Pacific Crest Trail and the Appalachian Trail end-to-end.  We head out for the top of Stone Mountain on a hot & muggy day.  There’s lots of other folks with the same idea, but it really is a long hot hike.  Most of my kids bail out after a mile or so, voting to go hang out in the AC (which is really a good plan); Eric and his two young charges make it to the path-up cutoff but it’s a killer hike in the heat so they turn around also.

It ends up as Shelley, Laura (age 15) and I heading on, and we decide to head for the bird sanctuary.  It’s another couple of miles and we gave most of the water to Eric & the girls.  The three of us head down the far side of the mountain to a kids playground and finally drag ourselves into the park and help ourselves to the water fountain.  We drink a quart each, and fill a couple more quart bottles we’re carrying.  We hike the 1/2mile more to the bird sanctuary – mostly carrying on now because of what Shelley would call “Mission” – her ex-Marine training to “complete the Mission” no matter the cost.  i.e., we’re all too collectively silly to claim the end goal is ridiculous, so we hike it anyways. It’s a decent enough little woodsy trail, with plenty of songbirds – but far to beastly hot to really enjoy.  By the time we make it back to the kids’ park we’ve drunken all our water (another 1/2gal between the 3 of us), so we reload (and re-drink our fill) and back up the mountain to cross it in reverse.  We make it back in good time, although it was really pushing our the limits to
hike so far on such a hot day.

There is much lounging around and napping in the RV’s AC to wait out the heat of the day.  Matthew (age 12) introduces the two little girls to the joys of Minecraft.  Eric & Laura nap.  Everybody else surfs the (very very slow) park Internet, eating popcorn & chips.  Finally as the heat starts to fade and twilight sets in we get enough gumption to make & eat hotdogs.  Then we pack it up and prepare to leave the relative safety and peace of the RV for the slowly building hoard.

The lawn below Memorial Hall faces the giant sculpture carved into the face of Stone Mountain.  The only open spaces are at the very front, so that’s where we head.  I estimate 100,000 people eventually filled that lawn; in any case it was a colossal crowd.  It was also actually quite a peaceful crowd; no rowdies (no alcohol allowed), zillions of little kids running pell-mell, picnic blankets, soap bubble makers and glowing flashing LED lights.  It’s cooler now, so we settle down on our blankets and chairs, listen to the music and wait for the show.  At various times I let Josh or Karen & Luke wander off for snacks (a little nerve-wracking that; they are out of sight in the crowd within seconds and gone for 30mins or more, but everybody returns fine).

The fireworks show starts promptly at 9:30 and is possibly the best I’ve ever seen.  There’s a laser & light show on the mountain, there’s a Civil War tribute, (there’s ads for all of Georgia’s major sports teams), there’s music and of course fireworks.  The actual fireworks where downright amazing; you get a double-echo from the Bang! works, one directly and one bounced off the mountain.  They used plenty of the big fireworks and absolutely tons of rising sparks kind; the entire mountain was a sheet of fire for minutes at a time. The finale left us breathless.

Unwinding back to the camp was a slow but uneventful crawl; I’ve sure we beat the campers on foot (who had to wait for the river-taxis and the report was to expect a 2.5 hr wait).  Eric took his to charges home and we collapsed tired but triumphant for a full nights sleep.

Cliff

 

Captain’s Log, Daze 5, 6 & 7

Captain’s Log, Day 5

It’s another early morning drive, this time we’re heading to San Antonio and then on to Luling TX for more relatives.  We’re still marching on through the great desert Southwest, but there are more signs of green now.  Some trees mixed in with the sage, and less cactus.

The ride to Luling is long but uneventful.  We give Luke another turn at the wheel.  The road is calm enough that we let Luke chug on for miles, and then we’re heading into San Antonio.  Suddenly the world is full of crazy drivers!  People are cutting in front of us, or darting around, or force-merging (on short merges) and giving us no space.  Luke brakes as he can, but we’re an 8 ton vehical!  We take at least twice as far to stop as a car!  We finally make it to a parking lot.

We have a great dinner in San Antonio with Grandpa & Grandma Weiner, and then we have to brave rush-hour traffic.  Shelley takes the helm this time, and a good thing too.  I’ve never seen such craziness.  We watched a pickup 4-wheeling it over the burm to cut traffic (and yes he set the dry grass on fire, we watched the smoke rise for a long time), we had endless numbers of people fight tooth-and-nail to get in front of us, only to switch lanes back a second later when some other lane had a slight advantage.  We have a little
sporty thing flash over from left to right, with us doing 60, with less than a foot spare across our bumper!  It was all over in an instant, and he missed us, but another foot and we woulda crunched him big.  It was a grueling two hours to get out of S.A.  Luling, where we spent the night at Grandpa’s house, was great.  We yakked all night while the kids worked out their cabin fever.  All in all, another fabulous Grandparent visit.

 

Captain’s Log, Day 6

Next morning, crack-o-noon, we headed out for my sister’s place in Katy (really far west Houston).  It’s another straight shot down I-10, and I-10 is in pretty good shape even out to Luling; as we approach Houston it widens to 6 lanes.  We start watching real weather appear; there’s a line of heavy thunderclouds forming up to the left and right of us and we’re heading right for them.  The wind starts to pick up and really buffet us; we slow down to 60 and then slower.  People are starting to park on the side of the road, but we want out of the impending storm.  Rain alternates between slashing and nothing.  The clouds get dark, low and ominous.  I start to see green clouds, and clouds moving the wrong direction.  I pull out Shelley’s “smart phone” and look up the local weather.  Sure enough, with vast modern technology, 4G wifi, low-power android-enabled cloud-backed internet weather smart-phone tech we discover what we already know: there’s two large thundercells on either side of I-10.  They happen alot during south Texas summers as warm wet Gulf air meets cooler midwest air.  And these storm cells often spawn tornadoes.  But after 20 mins of staring at awe-inspiring clouds and getting slammed by 40mph cross-winds we manage to roll through the middle of them and out the other side.  The rest of the trip in is entirelty uneventful, except for the trip down memory lane for me.

We get to my sister Ruth’s without incident and my kids rush in to play with her kids.  Then we have a comedy of errors trying to get power run to the RV.  First our old power cable gets hot and the RV power cuts off (which means the AC cuts off on a hot humid Houston summer day).  Then we think the outlet is bad, then we try to test the outlet with an old drill (drill not working), my laptop power supply (cannot see the little blue light in the sun), and finally a real tester (outlet is dead).  We switch outlets, then Aunt Ruth tells me the switch for that outlet is flakey, and it surely is; we quick-cycle the RV AC repeatedly without realizing it, and pop a 15amp house breaker.  We change outlets again, we change power cords again, we run the new cord through the garage to an internal 20amp circult, and finally it holds.  The RV stays well AC’d for the next 2 days.

Grandma’s over (*my* Mom this time) as she lives a few miles from my sister.  And we hang out and visit all day.  There’s wine & lasagna for dinner, and hot showers and full beds for all.

 

Captain’s Log Day 7

We all sleep in late.  We have pancakes & bacon for breakfast.  We run a few errands and then see the movie Brave (which is really good, BTW).  I end up connecting with an old college buddy and her boyfriend (Facebook!) so we invite them over for dinner.  Turns out the the boyfriend is also an old college friend, so suddenly it was Texas A&M U reunion night.  They are both divorced with one teenaged daughter each (compared to my 4), and enjoying life again after divorce.  We have a long evening of beer, hotdogs and college memories.  The kids Xbox continuously, or get their internet “fix” or play on the trampoline, or have drawing contests or otherwise monkey around.  It’s a really great “down time” lazy day.

Cliff

 

Captain’s Log Daze 3 & 4

Captain’s Log, Day 3

Next day we take a lazy breakfast and then decide to visit the Biosphere with Grandpa.  We head out to Nessie and observe the new tire is looking mighty flat.  Humm….  we hook up the air pump… and it’s at 80 psi, spot-on the normal max pressure.  Looking further: the inside tire is flat.  CRAP.  But wait!  It really WAS fine in Casa Grande, I checked it before we drove off.  And pretty quickly it’s clear that the rubber value stem is leaking, probably
banged too hard during the change and now it’s going flat overnight.  We call up Ed.  He agrees to change it under warranty… but he doesn’t want to drive out to meet us.  But he Does The Right Thing, and calls GCR Tire, a Tucson local who WILL come out to Grandpa’s.  Ed’s covering the whole cost.  So now we’re basically stuck at Grandpa’s waiting on GCR Tire (who’s promised to get there “in an hour” and it’s already 10am).

Meanwhile something triggered in Shelley’s brain about tires aging, so I go read up on them.  I learned something new today: all tires age.  After 6 years you should replace them, completely independent of tire wear.  Pretty much no tire is expected to last 10 years except under “ideal” circumstances.  And tire manufacturers have to stamp the date of manufacture on the tire, so you can tell how old your tires are.  (read up on it, but it’s the week & year of manufacture as a 4-digit number in an oval after the “DOT” stamp).  So we go look at our tires (mostly meaning Shelley crawling under the RV in the 110-heat to read between the dualies).  Sure enough, the youngest tire is 8 years, and the oldest is 12 years old.  Good tread, but expected to blow at any moment.

Crap, crap, crap.  Another round of planning & family voting.  We decide to limp over to Big-O tires to replace the remaining 5 tires, never mind fixing the old one.  GCR Tire shows up for the repair while I’m finishing negotiations with Big-O (and yes I asked GCR and no they did not have the tires we need in stock).  So the GCR guy politely fills our inside tire (it’ll last maybe an hour) and we roll over to Big-O.  We drop everybody off at Costco, where we do some shopping and eat a delicious Costco lunch (which is actually pretty dang cheap and a decent enough hot dog), and wait 2 hours for me to blow another $900 on tires.  After a while we’re back at Grandpa’s house with 6 brand-spanking new tires, waiting for a thunderstorm to pass before we go swimming.  It’s too late for the Biosphere, that will have to wait for another visit.

The thunderstorm takes too long to pass and we miss swimming also.  We have some more family over for a nice dinner, then we hit the road again for more night driving.  This time we’re heading for Carlsbad Caverns.  It’s a long haul out of Tucson but utterly uneventful.  We even give Luke (19 yrs old!) a turn at the wheel.  He’s a natural driver and handles this big rig fine.  We make a long drive of it but Carlsbad is just too far to make in one day.  We end up in the backside parking lot of a Walmart somewhere just inside the Texas border (Walmart mostly has a “RV friendly” policy).  It turns out that while our GPS has many useful features, finding RV campsites is not one of them.  Also when we turn off I-10 and head into the countryside we lose all cell phone service and can’t call ahead.

Captain’s Log, Day 4

It’s a 3hr early-morning drive or so to the Caverns.  We get there just before the heat starts getting oppressive again.  This time we decide to leave the generator on and the AC running while we spend the hot part of the day underground.  I used to see this all the time and wonder about it: RV’s with the generator going constantly.  Now I get it – Nessie will be in tolerable shape when we return to her, but without the AC Nessie would heat up like a tin
box in the hot sun.

Several of my kids are really nervous about entering the Caverns; they’ve had some scary cave experiences in the past.  We have to gently encourage several down the switchbacks into Carlsbad, but they master their fears and soldier on down into the cool cave air.  Carlsbad does not fail to deliver.  The Caverns are immense on a scale that’s hard to imagine; all of downtown San Jose could comfortably fit in them.  The trails wander on for miles in there (the sections closed to the public are probably 100x larger than the miles of public sections).  There’s a section where the roof soars over 300ft overhead and single rooms covering many acres with lines-of-sight of perhaps a quarter-mile underground.  And it’s all a fairyland of cave growths and little pools, with eerie lighting everywhere; flowing stone sculptures with names like “Temple of the Sun” or “Doll Theater”.  For the younger generation: it’s the largest Minecraft cave you’ll ever see.  🙂

We ride the 800ft (!) elevator lift back to the surface and decide to stay for the evening bat swarm (it’s still to hot to drive).  Every evening at dusk between 250 thousand and a few million bats leave to go eat mega-tons of insects up and down the local rivers (the numbers fluctuate so much because the bats migrate frequently).  We hang out in the local gift shop & cafe for a few hours (always a bad plan when on a budget), then try to watch a movie in Nessie (AC keeps it tolerable in there, but it’s still pretty warm), and finally evening rolls around.  We settle in to listen to the rangers and then finally the main show: 250 thousand bats fly out of the cave like smoke on the wind.  There’s a faint odor of bats in the air, and an endless murmuring of chirping bats and the little winged creatures are flitting everywhere overhead before flying off the escarpment edge and off into the darkness.

We do another (not so long) night of driving, stopping at midnight in Fort Stockton, TX.  We get a longer nights’ sleep tonight, even if the location isn’t as glorious.

Cliff

 

Captain’s Log Days 1 & 2

Captain’s Log, Day 1

Today’s the day for the start of our epic 3-week 7000-mile cross-country RV trip of doom!  I’m up (fairly) early as I need to pick up all my kids – and their extra clothes, toiletries, games, meds, etc – by 9am.  Then I take them back to my house to begin packing in earnest, except for Josh who I need to take to the eye doctor’s to replace his glasses (broke under warranty) and Laura – who left her drawing pad behind.  I also need to drop my ex-Sprint AirWave back at the UPS store, and go by the pharmacy for a month’s worth of meds, and get fresh fruit for the RV, and… and … and … you get the picture.

Meanwhile Shelley is busy doing last-minute packing of Nessie, our 7-ton 31′ Class C RV – all the fruits & veggies & cold-stuff go in in the last minute.  While I’m running around frantically driving kids all over creation, Matt figures out he’s got a total of 3 pairs of underwear at my place, so Shelley is out driving him to get some undies (and other stuff we need) while I’m running my errands.  Despite all the crazy start and hasty lunches we actually hit the road as planned right at noon.

So on this trip we have: Me & Shelley (a red-head), my eldest daughter Karen, Luke (another red-head), my son Josh, my 2nd daughter Laura (also a red-head but no relation to Shelley) and my youngest Matt.  We’re off to see the country and all my scattered relations.  I’ve got my Dad (& Jane) living in Tucson AZ the kids’ other grandpa Zade in Luling TX (outside of San Antonio TX), my sister (Aunt Ruth) and mom (Pat Ireland) in Katy TX (outside of Houston), my brother in Atlanta GA, and my Uncle Bill and his 4 daughters (all my age) and their 15 kids (all my kids’ ages) in eastern Connecticut.

We’re starting out of San Jose, heading over Pacheco Pass to I-5, then south towards LA – but we badly do NOT want to hit LA right at rush hour, so we eventually cut over to Bakersfield and then follow some long long slow farm road across the central valley to Barstow… and up to Calico, a ghost town.

Now when Shelley was a kid, her mom would drive this very road (to visit her grandparents in Vegas) and they would stop by Calico once a year or so.  She has some fond memories from her childhood so visiting Calico is somewhat of a pilgrimage to her.  We arrive there right at dusk and can’t find anybody manning the entrance booth, so we sheepishly drive (our 31′ RV) quietly into the town – and promptly find the RV campground.  It’s basically deserted (there’s 1 other camper there, and space for maybe 100 vehicles), has power hookups and bathrooms with showers and running water… and it’s free, at least for people
arriving as late as we did.  We got out, stretched our legs and enjoyed the beautiful pink sunset over the red red hills, made sloppy joes on Nessie’s stove and ate on the picnic tables in picture-perfect weather.  Laura got the neighborhood dogs to howl back at her, Karen & Luke made videos of the epically blowing Laura’s hair, Matt climbed the hills and Josh & I ninja-sparred.

It was a picture-perfect ending to the 1st day.

Captain’s Log, Day 2

We walk though Calico the next morning.  It’s cool desert morning air, with some wonderful history.  The town’s been cleaned up a fair amount since Shelley was last there but remains a really nice tourist trap.  Mission accomplished, we head out for the long hot desert drive to Tucson to visit my Dad (Grandpa).  It’s a *long* boring drive down I-40.  Shelley is an awesome long-haul truck-and-horse-trailer driver, so driving this RV thing is a piece of cake.  (and while I’m up getting Shelley a nectarine, Laura types in my blog: “Moo” and “He has yet to notice.”)  Karen is talking about whale sperm shampoo (*not* sperm whale shampoo)… and the generator cuts out – it’s overheated.  That means the main compartment AC cuts out.  Oh – did I mention that on the long uphill grades the cab also AC cuts out?  (I assume because the engine is working too hard?).  So we pressed on in the 110-degree heat, across I-40, down “highway” 95 (looks like asphalt thinly spread over desert dirt, there’s a whole lotta “dips in road”).  Back on I-8 and heading west, and 2hrs out of Tucson and we’re all baking on-and-off (as the cab AC cuts in and out, and the cabin is slowly climbing above 90degrees)… when we blow a tire.

Yup, 20 miles from Nowhere, AZ, down that long & lonely road… we suddenly pick up a shake & shimmy… and a list.  We hove Nessie over to port and off the side of the road.  I tremulously step out to survey the damage.  Outside right rear tire has blown big, completely come apart.  It’s 1 of a dually, and the other is squashed under the load but holding.  Time for some quick thinking; we are baking and a long way from anywhere… and lame.  We check the phones: we have cell service; Thank You T-Mobile.  We call AAA.  They don’t do RV tires but they do give us the number of RV Medic in Casa Grande… which isn’t open after hours.  We get the answering machine & another number to call… also no answer.  So now we’re calling all about (at least 3 phones making active calls at this time, plus Google map’s are in hot action).  We decide to limp into Casa Grande.  We dump the tanks (not the black!) and push the kids over to the “good” side to lighten the load.  We also batten down the hatches, as Shelley points out that if the remaining tire blows we’ll “drop hard”.  Casa Grande is about 20miles down the road, and we decide that 40 mph is probably a good max-speed so we start off.

Then the dust-storm hits.  NO I AM NOT KIDDING.  We’re lamely limping along when the wall of dust hits, obliterating the “Blinding Dust Storms” road sign.  So now we’re limping blindly along getting buffeted by 40mph winds and dust (and tumbleweeds ARE blowing by, queue lonely wild-west music please) when the rain hits.  Yes: thick dust on our windshield AND ITS RAINING NOW WITH THE BLOWING DUST.  Nessie soldiers on.  20min later we pull off I-8 and out of the storm and head down some lonely farm road… but with the lights of Casa Grande clearly in the distance.  We pull into the first big lot we see (Big Tires empty lot) step out and see a rainbow.  Back around to calling RV Medic we get a human, we tells me to call Ed W’s who DOES do after-hours work.  $200 minimum charge.  Ed (who also requires 3 or 4 phone calls to reach) promises he can work on us, but can we get to town?  No problem.

While we wait at Ed’s shop for an hour (his mobile guy is on another call), Grandpa & Grandma drive up from Tucson and take the 3 younger kids back to their place and feed them all manner of treats.  We (4 remaining) older kids mosey over to a nearby restaurant and get dinner and some heat relief.  Another hour later and I’m $400 poorer and sporting a brand-new tire.  We pile in and make it to Grandpa’s.  Much sighs of relief, and a good nights sleep was had by all.

Cliff

 

Progress + Vacation

It’s been a freak’n month since I last blogged!  Where’s the time gone???

Mostly I’ve been furiously coding.  ‘wc *java’ of our ‘src’ directory now reports 31500 lines.  We’ve cleaned up and CSS’s the web interface.  We added LevelDB to handle zillions of small K/V pairs (larger ones go to the local file system directly, and of course we still handle S3 and HDFS natively (either using an existing hadoop install, or directly *being* a distributed hadoop)).  We’re still 100% peer-to-peer, even for the direct HDFS stuff.  Last week I hacked a concurrent Patricia Trie (leaving the making of a *distributed* concurrent Trie for later, but now I know how to do it…). Then we ran all 36Gig of Wikipedia data through WordCount, using that Trie – it took less than an hour on 1 node.

This week it’s about running a Linear Regression *distributed*, using distributed Fork/Join as the programming paradigm.  Also, integrating a HashMap-in-a-Value (so we can pass about & maintain the Map interface in the Value piece of our K/V store – think: distributed JS objects), plus the final bits of VectorClocks (all behind the scenes; the VCs will let us do atomic update and strong coherence of Keys but they’re a horrible API to expose).  We’re building a toolkit approach to solving the problem of building a reasonable database over the Cloud.  Either (distributed) Patricia Tries or (distributed) Concurrent Skip Lists for range queries, plus JS-like objects in Values, plus atomic (transactional) update of individual JS objects using a Compare-And-Swap like approach (instead of locking: CAS is much faster under load, as threads can optimistically make progress).

More on all of the above later this week – as we have a hard deadline to finally *open* our Open Source project.  Yeah, yeah, yeah, I’ve been hassled plenty about calling ourselves Open Source and not (yet) having any open source… we’ve been trying to get the basics done first… but the real news:  I’m finally going on Vacation!!!

Yes, Nessie, the 31′ 7-ton Class C RV of Doom is being prepared for our 7000 mile Epic Cross-Country Journey.  I’ve been wanting to do this for a decade now: take the entire clan (7 of us!) across country, touring all the junk tourist traps we can and visiting our scattered family as we go.  We got family in Tucson AZ, San Antonio TX (well, Luling really), Houston, Atlanta, DC area, and Connecticut.  I’m giving an invited lecture at UIUC on our way back, and have been assured I can use that lecture as a reason to declare this a “business trip”, and deduct all the gas and mileage costs – I figure about $3500 in gas alone.  We stopping at Stone Mountain in GA over the 4th of July, visiting my brother and camping at the lakeside facing the mountain where we’ll watch the fireworks and lazer show from the RV roof.  We’re going to visit Carlsbad Caverns.  We’ll pass through DC and maybe attempt the Smithsonian (not sure about that one; depends on the schedule and how badly I want to fight the RV through DC traffic).  We’re visiting my Uncle’s classic family farm in Connecticut where my 4 cousins live – all my age, all married with 3 to 4 kids each… all about the same age as my 4 kids.  We’re talking now about 15 to 20 neices and nephews, plus Aunts & Uncles galore, and of course pigs and chickens and horses.  It’ll be a regular zoo.

So if you see a large white whale heading east on I10 with a frazzled Shelley or my excited 19yr-old at the helm, honk, wave Hi and give us a wide berth…

Cliff

 

Quote(s) of the Month from Kevin Normoyle (Sun/Sparc & Azul L2 Cache Designer Extraordinaire, Cache Coherence Advisor to 0xdata):

Reminds me of CS101, on one of my first programs.  The grader wrote in big red letters over my big comment block:
“Don’t document your bugs, Fix them”

So I asked Kevin if I could quote him, and I got this response back:

ah that’s fine…I spout “Advice” left and right to everyone… Many dismiss it as “Rant”.  There’s always that fine line between being a Prophet, and just another crazy guy standing on the corner yelling.  One could argue that everyone who every posts to Twitter is an “Advisor” of some sort, to the world.

Sound advice, from a (reluctant) adviser to the world.

http://www.cs.tau.ac.il/~shanir/nir-pubs-web/Papers/OPODIS2006-BA.pdf

The D3 Bomb

The Diablo3 bomb blew through my house this week, destroying work schedules left and right. Every kid (& Dad) played hours of D3.  OMG’s, I can remember D1 – way back in ’96 before the Diablo’s were numbered.  I must be older than dirt.  Also, being CTO of 0xdata means a zillion customer visits last week (thanks to our plugged-in CEO Sri).  Git claims 600 lines of code from me, down from my weekly average of 3000… blah.  Coding is good for me, I need to do more!

Meanwhile, work at 0xdata is actually proceeding really well despite my lackluster week.  We’re reading & writing HDFS natively.  As I write this, we’re now able to read & write S3.  We’ve got the semantics and design of what is basically the Java Memory Model ironed out for the Cloud (although the implementation is still being worked on).  We’re starting to launch Paxos-based H2O clouds in Amazon EC2.  We’re running larger test suites.

What little coding I did was relating to making Key-delete work right.  The issue is racing Puts followed by Deletes, and delivering a strongly consistent answer when UDP packets are getting lost or re-ordered.  A late-arriving Put cannot “resurrect” a deleted Key and that requires keeping some VectorClock smarts on the deleted Key, instead of just removing all knowledge of the Key.

We’ve got the Git repro opened up to a handful of people and we’re debating when to open it fully.  I’m voting for “wait a little longer”; in particular I want to iron out the design of the execution engine more.  I.e., “word count” on HDFS should not just run fast & well, it should look good also.  I might get overruled on the timing of this, but in any case look for our Git to open up “soon” – some weeks or less.

In other news, I got my $500 deductible returned to me from AllState (which they got from the other drivers’ insurance).  We sold my fiance’s junker car and upgraded her to a car with only 70K miles (down from 225K miles!  The unkillable Nissan Maxima’s brakes finally failed).  I switched the family over from Sprint to TMobile – it’s a better family plan (for me anyways), and that means I finally upgraded my antique phone… to another antique!  Yes!  I managed to dodge the smart-phone brain-drain that’s got all my colleagues one more time.  🙂

Cliff

 

What’s Going On?

As alluded to in my last blog, here’s my fun hack de-jeur: “Whats going on?”

I’ve got a multi-node setup with UDP packets slinging back and forth, and each node itself is a multi-cpu machine.  UDP packets are sliding by one another, or getting dropped on the floor, or otherwise confused.  I’m in a twisty maze of UDP packets all alike (yes, I played the game back in the day).  Then something crashes, and pretty quickly the network is filled with damage-control packets, repair & retry packets, more infinite millions of mirror reflection packets.  What just happened?  I press my handy little button and…

… a broadcast of “dump, ship and die” hits the wires (a few extra times for good measure).  All my busy Nodes stop their endless chatter and dump the last several seconds of packets towards my laptop, slowly & reliably, via TCP.  Each node has been gathering all the packets sent or received to (well, the first 16 bytes of each) in a giant ring-buffer, along with time-stamp info and the other party involved.  After I ship all this data from every node to the one poor victim (that I pressed my button on), every other node dies (to prevent further damage).

The Last Survivor gathers up a bunch of very large UDP packet dumps and starts sorting them.  Of course, you can’t just sort on time, that would be too easy.  No, all the nodes are running with independent clocks; NTP only gets them so close in time to each other.  Instead I have to sort out a giant Happens-Before relationship amongst my packets.  I am helped (above and beyond some sort of home-brew wire-shark) by my application understanding it’s own packet structure.  I *know* certain packets must be strongly ordered in time, never mind what the clock says.  For example, I only send out an ACK for task#1273 strictly after I receive (and execute) task#1273.  Paxos voting protocols follow certain rules, etc, etc.

In the end, I build a very large mostly-correctly-ordered timeline of what was just going on, as seen by each Node itself, and then HTML’ify it and pop it up on the browser.  Voila!  There for all the world to see is the blow-by-blow confusion of what went wrong (and generally, the follow-on error “recovery” isn’t all that healthy, so more broken behavior follows hard on the heels of broken behavior).

Basically, I’m admitting I’m a tool-builder at heart.  As soon as I realized that standard debuggers don’t work in this kind of situation, and wireshark couldn’t sort based on domain-specific info (and pretty-print the results, again using domain-specific smarts), I went into tool-building mode.  As of this blog, I’ve found several errors in my cloud setup already; e.g. a useless abort-and-restart of a Paxos vote if a heartbeat arrives mid-vote from an ex-cloud-member (that’s alive and well and wants to get back in the Cloud), and some infinite-chatter issues getting key replication settled out as nodes come and go.

On other fronts, my car came back from the body shop, only to turn around and go back to the engine shop: the timing belt had slipped.  The work was done under warranty and I’ll go pick up my car on Monday.  I can hardly wait!!!

My GFs car’s brakes have been squealing for weeks; they finally started shuddering and we decided it was time to fix them.  She’s driving a 1993 Nissan Maxima with 220K miles on it; weird things start breaking at that age, but mostly the car just soldier’s on.  But it was time for the brakes.  We pulled the rear pads & looked at the rotors: one of them was shot.  Fortunately a new rear rotor was only $25, plus another $22 for pads (tax, brake grease, still under $50).  We couldn’t get the dang pistons to move back! We tried at least 5 different wrench/jig/clamp combos to no avail.

We figured the pistons must have been jammed with debris, so with great trepidation we pulled the brake fluid line, the emergency brake cable and pulled the whole unit to my workbench.  I popped the piston out manually.  It looked clean and good… and had this funny thing in the middle… stupid me, failed to check the internet again… it’s the anti-slip mechanism for the emergency brake.  You have to *spin* the piston to screw it back into the cylinder.  Sigh.  It took us another 1/2hr to find the right tool to spin the dang thing, but it finally went in without too much trouble.  After that it was another hour to reassembly all the parts, and then we had to bleed and bleed and bleed the line.  As of this writing, the pedal is still to soft, I suspect we need to bleed it some more.

Daughter is at the Old Salts Regatta, plus a ton of driving to meet people for 0xdata, plus a much needed dinner out… and down 2 cars (GF’s brakes-in-progress and my car in the shop), made for a very complicated week.

Cliff