The D3 Bomb

The Diablo3 bomb blew through my house this week, destroying work schedules left and right. Every kid (& Dad) played hours of D3.  OMG’s, I can remember D1 – way back in ’96 before the Diablo’s were numbered.  I must be older than dirt.  Also, being CTO of 0xdata means a zillion customer visits last week (thanks to our plugged-in CEO Sri).  Git claims 600 lines of code from me, down from my weekly average of 3000… blah.  Coding is good for me, I need to do more!

Meanwhile, work at 0xdata is actually proceeding really well despite my lackluster week.  We’re reading & writing HDFS natively.  As I write this, we’re now able to read & write S3.  We’ve got the semantics and design of what is basically the Java Memory Model ironed out for the Cloud (although the implementation is still being worked on).  We’re starting to launch Paxos-based H2O clouds in Amazon EC2.  We’re running larger test suites.

What little coding I did was relating to making Key-delete work right.  The issue is racing Puts followed by Deletes, and delivering a strongly consistent answer when UDP packets are getting lost or re-ordered.  A late-arriving Put cannot “resurrect” a deleted Key and that requires keeping some VectorClock smarts on the deleted Key, instead of just removing all knowledge of the Key.

We’ve got the Git repro opened up to a handful of people and we’re debating when to open it fully.  I’m voting for “wait a little longer”; in particular I want to iron out the design of the execution engine more.  I.e., “word count” on HDFS should not just run fast & well, it should look good also.  I might get overruled on the timing of this, but in any case look for our Git to open up “soon” – some weeks or less.

In other news, I got my $500 deductible returned to me from AllState (which they got from the other drivers’ insurance).  We sold my fiance’s junker car and upgraded her to a car with only 70K miles (down from 225K miles!  The unkillable Nissan Maxima’s brakes finally failed).  I switched the family over from Sprint to TMobile – it’s a better family plan (for me anyways), and that means I finally upgraded my antique phone… to another antique!  Yes!  I managed to dodge the smart-phone brain-drain that’s got all my colleagues one more time.  🙂

Cliff

 

What’s Going On?

As alluded to in my last blog, here’s my fun hack de-jeur: “Whats going on?”

I’ve got a multi-node setup with UDP packets slinging back and forth, and each node itself is a multi-cpu machine.  UDP packets are sliding by one another, or getting dropped on the floor, or otherwise confused.  I’m in a twisty maze of UDP packets all alike (yes, I played the game back in the day).  Then something crashes, and pretty quickly the network is filled with damage-control packets, repair & retry packets, more infinite millions of mirror reflection packets.  What just happened?  I press my handy little button and…

… a broadcast of “dump, ship and die” hits the wires (a few extra times for good measure).  All my busy Nodes stop their endless chatter and dump the last several seconds of packets towards my laptop, slowly & reliably, via TCP.  Each node has been gathering all the packets sent or received to (well, the first 16 bytes of each) in a giant ring-buffer, along with time-stamp info and the other party involved.  After I ship all this data from every node to the one poor victim (that I pressed my button on), every other node dies (to prevent further damage).

The Last Survivor gathers up a bunch of very large UDP packet dumps and starts sorting them.  Of course, you can’t just sort on time, that would be too easy.  No, all the nodes are running with independent clocks; NTP only gets them so close in time to each other.  Instead I have to sort out a giant Happens-Before relationship amongst my packets.  I am helped (above and beyond some sort of home-brew wire-shark) by my application understanding it’s own packet structure.  I *know* certain packets must be strongly ordered in time, never mind what the clock says.  For example, I only send out an ACK for task#1273 strictly after I receive (and execute) task#1273.  Paxos voting protocols follow certain rules, etc, etc.

In the end, I build a very large mostly-correctly-ordered timeline of what was just going on, as seen by each Node itself, and then HTML’ify it and pop it up on the browser.  Voila!  There for all the world to see is the blow-by-blow confusion of what went wrong (and generally, the follow-on error “recovery” isn’t all that healthy, so more broken behavior follows hard on the heels of broken behavior).

Basically, I’m admitting I’m a tool-builder at heart.  As soon as I realized that standard debuggers don’t work in this kind of situation, and wireshark couldn’t sort based on domain-specific info (and pretty-print the results, again using domain-specific smarts), I went into tool-building mode.  As of this blog, I’ve found several errors in my cloud setup already; e.g. a useless abort-and-restart of a Paxos vote if a heartbeat arrives mid-vote from an ex-cloud-member (that’s alive and well and wants to get back in the Cloud), and some infinite-chatter issues getting key replication settled out as nodes come and go.

On other fronts, my car came back from the body shop, only to turn around and go back to the engine shop: the timing belt had slipped.  The work was done under warranty and I’ll go pick up my car on Monday.  I can hardly wait!!!

My GFs car’s brakes have been squealing for weeks; they finally started shuddering and we decided it was time to fix them.  She’s driving a 1993 Nissan Maxima with 220K miles on it; weird things start breaking at that age, but mostly the car just soldier’s on.  But it was time for the brakes.  We pulled the rear pads & looked at the rotors: one of them was shot.  Fortunately a new rear rotor was only $25, plus another $22 for pads (tax, brake grease, still under $50).  We couldn’t get the dang pistons to move back! We tried at least 5 different wrench/jig/clamp combos to no avail.

We figured the pistons must have been jammed with debris, so with great trepidation we pulled the brake fluid line, the emergency brake cable and pulled the whole unit to my workbench.  I popped the piston out manually.  It looked clean and good… and had this funny thing in the middle… stupid me, failed to check the internet again… it’s the anti-slip mechanism for the emergency brake.  You have to *spin* the piston to screw it back into the cylinder.  Sigh.  It took us another 1/2hr to find the right tool to spin the dang thing, but it finally went in without too much trouble.  After that it was another hour to reassembly all the parts, and then we had to bleed and bleed and bleed the line.  As of this writing, the pedal is still to soft, I suspect we need to bleed it some more.

Daughter is at the Old Salts Regatta, plus a ton of driving to meet people for 0xdata, plus a much needed dinner out… and down 2 cars (GF’s brakes-in-progress and my car in the shop), made for a very complicated week.

Cliff