How does Java Both Optimize Hot Loops and Allow Debugging

This blog came about because an old friend is trying to figure out how Java can do aggressive loop and inlining optimizations, while allowing the loading of new code and setting breakpoints… in that optimized code.

On 2/21/2015 11:04 AM, IG wrote:

Safepoint. I’m still confused because I don’t understand what is the required state at a safepoint that can’t be ensured at any random point of execution.  What is in-flight that must be permitted to complete? Given a cycle-by-cycle map you can find all the pointers at *any* cycle – is safepointing just a way to reduce the volume of pointer-locating info?

Reducing OOP-map volume happens, is useful – but is not the main thing being achieved.

It’s flipping backwards from the machine state to the language architected state, unwinding from super-aggressive optimization to reach the theoritical Java (or C/C++) language state at some points.  Toooo expensive to preserve that mapping line-by-line, call-by-call, execution-point-by-execution-point. So you keep the mapping at a few places, and optimize the bejeebers out of things in-between.  If somebody wants to:

  • set a breakpoint in optimized code (yes, works for Java all the time)
  • inject new classes, new targets for some virtual-call

then you need to stop the physical machine (actual CPU/Mill) at some place where you can find all the language architected pieces.  This mapping is far more complex than the oop/no-oop mapping for GC… but isn’t why you make it rare.  You make it rare because you have to hang onto all the language state at these safepoints, but don’t need it between safepoints.

An example might help:

  for( int i=0; i<N; i++ )
    sum += A[i]  // set a breakpoint here, for "i>1e6" or maybe "i>1e7"

So intent is to run awhile… then somebody sets a breakpoint… and wants to inspect “sum”, or change “i” in the debugger, then continue execution.  What’s the machine code look like, with no safepoints?  Something like this (bogus X86-like assembly code, pardon non-perfection):

 ... some setup, zero-trip test, range-check on "N" vs "A.length", etc
 ld  RA=&A        // Classic strength-reduced pointer into array A
 add RC,RA,#N*8   // Stopping value of A
 ld  RB=0         // sum
 loop:
 ld  RD=[RA+off]  // Load from A
 add RB=RB,RD     // Sum into RB
 add RA=RA,8      // Advance our pointer into A
 cmp RA,RC        // Stop when at the end
 blt loop
 // RB contains sum

 

But actually, with some loop unrolling:

... some setup, zero-trip test, range-check on "N" vs "A.length", etc
 ld  RA=&A        // Classic strength-reduced pointer into array A
 add RC,RA,#N*8 & (-1-3)  // mask of low 2 bits of iteration - unroll by 4
 ld  RB=0 // sum
 loop:
 prefetch [R1+off+32]  // Fetch next cache line... overlap memory & compute
 ld  R0=[RA+off+ 0]    // Get 4 values array A
 ld  R1=[RA+off+ 8]
 ld  R2=[RA+off+16]
 ld  R3=[RA+off+24]
 add R4=R0,R1  // add-tree to allow more parallel adders
 add R5=R2,R3  // add-tree to allow more parallel adders
 add RB=RB,R4  // partial sum into RB
 add RB=RB,R5  // full sum into RB
 add RA=RA,8*4 // Skip over 4 elements of array A
 cmp RA,RC     // Stop when at the end
 blt loop
 ... cleanup code for 0-3 more iterations
 // RB contains sum

 

Now I want to break on the “sum += A[i]” line… where is that, exactly, in that unrolled loop?

And which register contains “i”?  (Hint: none of them).

Do I even want to keep a shadow copy of “i” in parallel with my “A” pointer I’m rolling through the loop?  It’s pure overhead, unless I breakpoint and want to have some place to read “i”…. but even then I cannot *change* this shadow “i” because it won’t change my “A” pointer.

What about changing N?  Do I have enough info in the register mapping to describe all possible kinds of updates I might want to do to register RC when I tweak N?  This example is simple, but optimizing compilers can do hella complex code changes.

How does a Safepoint help here?  Well, for starters it points out that I do indeed need some way to find “i”, and if I change it then propagate those changes back into the loop.  Lets look at this code with a Safepoint in it.

... some setup, zero-trip test, range-check on "N" vs "A.length", etc
 ld  RA=&A        // Classic strength-reduced pointer into array A
 add RC,RA,#N*8 & (-1-3)  // mask of low 2 bits of iteration - unroll by 4
 ld  RB=0         // sum
 ld  RI=0         // Concession to Safepoint: keep "i" around
 loop:
 prefetch [R1+off+32]  // Fetch next cache line... overlap memory & compute
 ld  R0=[RA+off+ 0]    // Get 4 values array A
 ld  R1=[RA+off+ 8]
 ld  R2=[RA+off+16]
 ld  R3=[RA+off+24]
 add R4=R0,R1  // add-tree to allow more parallel adders
 add R5=R2,R3  // add-tree to allow more parallel adders
 add RB=RB,R4  // partial sum into RB
 add RB=RB,R5  // full sum into RB
 add RA=RA,8*4 // Skip over 4 elements of array A
 add RI=RI,4   // Consession to Safepoint: keep "i" around
 SAFEPOINT: RI contains "i", RB contains "sum", "some place in the stack" contains "N"
 cmp RA,RC     // Stop when at the end
 blt loop
 ... cleanup code for 0-3 more iterations
 // RB contains sum

 

So I added 1 op in my hot loop to compute “i” on the fly.  Not too expensive; 1 clock per cache-line of data.  In general we end up keeping alive what is otherwise programmatically dead “just in case” the user stops the running code and wants to inspect (dead) variables – but this is cheap.  Just spill ’em off to stack somewhere and flag ’em in the Safepoints variable map.

How do we trigger the Safepoint?  Easiest way is to have the safepoint code test some thread-local bit for a “stop” bit, and if found… stop.  If we have to stop we might have a lot of cleanup to do, but stopping is rare so the cost (of mostly not stopping) is low.  There’s lots of ways to make the safepoint check cheap.  Here’s one that depends on keeping a 2Meg (TLB-sized) window above the stack pointer mapped in or out – perfect for an X86 TLB:

  ld RX,[RSP+2M] // will trap if we map this page out, then catch the SEGV & handle

 

How do we use the Safepoint?  Once we’ve stopped the running thread mid-loop, we expect the trap handler to have saved off all the registers.  Later we can inspect the register map to find “i” or “sum”.  Note that a Java-level query cannot ask for our “RA” register – as it’s the sum of “&RA” and “i*8” and isn’t visible at the language level – so “RA” does not appear in the Safepoint’s map.

How do we change “i” or “N” and have it All Just Work?  Basically, we re-enter a clone of the loop from above…. a clone specially made to be re-entered at any iteration.  The loop setup code in the clone will rebuild “RA” and “RB” and any other intermediate state as needed, before falling into essentially the same code as the normal loop (and indeed they might be shared here!).  Or in other words – since the optimizer did something complex to build “RA” from “&A” and “i” – have the optimized do it again for any arbitrary new “i”.

The general plan is simple (from the optimizing compiler’s point-of-view): Safepoints are treated like a function-call only taken on a very very rare execution path.  Safepoints take as inputs all visible language-level state, including the state of all inlined function calls, and generally are assumed to update all memory state but not any local vars (if you are also hacking local variable state via the debugger interface, then we need to go a step further and deopt/reopt – leading to the loop clone mentioned above).

With this magical “low frequency function call” in hand, the optimizer attempts the best optimizations it can, subject to the normal constraints that the call arguments are available somehow.

Hope this helps,
Cliff

 

Hacking in H2O

In preparation for H2OWorld – I written a couple of blogs to help people quick-start in hacking in H2O.  All of these talks blend engineering, math and Big Data – and need nothing more than a really basic (free) Java development environment.

http://0xdata.com/blog/2014/11/Hacking/Algo/KMeans

http://0xdata.com/blog/2014/11/Hacking/Algo/Quantiles

http://0xdata.com/blog/2014/11/Hacking/Algo/Grep

 

Cliff

Sparkling Water: H2O + Scala + Spark

H2O & Scala & Spark

Spark is an up and coming new big data technology; it’s a whole lot faster and easier than existing Hadoop-based solutions – and H2O is the premier open-source Big Data Machine Learning solution. We are happy to announce that H2O now has a basic integration with Spark – Sparkling Water!

This is a “deep” integration – H2O can convert Spark RDDs to H2O DataFrames and vice-versa directly. The conversion is fully in-memory and in-process, and distributed and parallel as well. This also means H2O has a (nearly) full integration with Scala as well – H2O DataFrames are full Scala Collection objects – and the basic foreach collection calls naturally run distributed and parallel also.

You can read the rest of my blog on my company website!

Cliff

 

H2O Architecture

There have been a lot of questions on the H2O architecture, I hope to answer the top-level ones here but I’ve included the first paragraph here.

H2O is a fast & flexible engine.  We talk about the Map/Reduce execution flavor a lot in public, because it’s easy to explain, because it covers a lot of ground, and because we’re implemented a bunch of dense linear-algebra style algorithms with it – but that’s not the only thing we can do with H2O, nor is it the only coding “style”.

H2O is based on a number of layers, and is coded to at different layers to best approach different tasks and objectives.

  • In-memory distributed K/V store layer: H2O sports an in-memory (not-persistent) K/V store, with exact (not lazy) consistency semantics and transactions.  The memory model is exactly the Java Memory Model, but distributed.  Both reads and writes are fully cachaeble, and typical cache-hit latencies are around 150ns (that’s 150 nanoseconds) from a NonBlockingHashMap.  Let me repeat that: reads and writes go through a non-blocking hash table – we do NOT suffer from a classic hot-blocks problem.  Cache-misses obviously require a network hop, and the execution times are totally driven by the size of data moved divided by available bandwidth… and of course the results are cached.  The K/V store is currently used hold control state, all results, and the Big Data itself.  You can certainly build a dandy graph-based algorithm directly over the K/V store, or integrate H2O’s model scoring into a low-latency production pipeline.
    • A note on cluster size:  H2O clusters are peer-to-peer, and have no upper bound on the size.  We have built clusters with over a 100 nodes.  Every node “knows” where every Key is, without actually having to hold onto the Key.  Hence asking for a particular Key’s Value is no more expensive than 1 network hop from the Key requester to some Key holder, and back.
  • A columnar-compressed distributed Big Data store layer: Big Data is heavily (and losslessly) compressed – typically 2x to 4x better than GZIP on disk (YMMV), and can be accessed like a Java Array (a Giant greater-than-4billion-element distributed Java array).  H2O guarantees that if the data is accessed linearly then the access time will match what you can get out of C or Fortran – i.e., be memory bandwidth bound, not CPU bound.  You can access the array (for both reads and writes) in any order, of course, but you get strong speed guarantees for accessing in-order.  You can do pretty much anything to an H2O array that you can do with a Java array, although due to size/scale you’ll probably want to access the array in a blatantly parallel style.
    • A note on compression: The data is decompressed Just-In-Time strictly in CPU registers in the hot inner loops – and THIS IS FASTER than decompressing beforehand because most algorithms are memory bandwidth bound.  Moving a 32byte cached line of compressed data into CPU registers gets more data per-cache-miss than moving 4 8-byte doubles.  Decompression typically takes 2-4 instructions of shift/scale/add per element, and is well covered by the cache-miss costs.  As an example, adding a billion compressed doubles (e.g., to compute the average) takes 60ms on a 32-core (dual-socket) E5-2670 Intel – an achieved bandwidth of 133Gb/sec – hugely faster than all other Big Data solutions.

This is part of a much larger blog I wrote… the full post can be found on my official 0xdata website here:

http://0xdata.com/blog/2014/03/h2o-architecture/

Sorry for the redirect, but I’m trying to get people to migrate to my new blog site.

Thanks,
Cliff

 

TCP is UNreliable

Been to long between blogs…

TCP Is Not Reliable” – what’s THAT mean?

Means: I can cause TCP to reliably fail in under 5 mins, on at least 2 different modern Linux variants and on modern hardware, both in our datacenter (no hypervisor) and on EC2.

What does “fail” mean?  Means the client will open a socket to the server, write a bunch of stuff and close the socket – with no errors of any sort.  All standard blocking calls.  The server will get no information of any sort that a connection was attempted.  Let me repeat that: neither client nor server get ANY errors of any kind, the client gets told he opened/wrote/closed a connection, and the server gets no connection attempt, nor any data, nor any errors.  It’s exactly “as if” the client’s open/write/close was thrown in the bit-bucket.

We’d been having these rare failures under heavy load where it was looking like a dropped RPC call.  H2O has it’s own RPC mechanism, built over the RUDP layer (see all the task-tracking code in the H2ONode class).  Integrating the two layers gives a lot of savings in network traffic, most small-data remote calls (e.g. nearly all the control logic) require exactly 1 UDP packet to start the call, and 1 UDP packet with response.  For large-data calls (i.e., moving a 4Meg “chunk” of data between nodes) we use TCP – mostly for it’s flow-control & congestion-control.  Since TCP is also reliable, we bypassed the Reliability part of the RUDP.  If you look in the code, the AutoBuffer class lazily decides between UDP or TCP send styles, based on the amount of data to send.  The TCP stuff used to just open a socket, send the data & close.

So as I was saying, we’d have these rare failures under heavy load that looked like a dropped TCP connection (was hitting the same asserts as dropping a UDP packet, except we had dropped-UDP-packet recovery code in there and working forever).  Finally Kevin, our systems hacker, got a reliable setup (reliably failing?) – it was a H2O parse of a large CSV dataset into a 5-node cluster… then a 4-node cluster, then a 3-node cluster.  I kept adding asserts, and he kept shrinking the test setup, but still nothing seemed obvious – except that obviously during the parse we’d inhale a lot of data, ship it around our 3-node clusters with lots of TCP connections, and then *bang*, an assert would trip about missing some data.

Occam’s Razor dictated we look at the layers below the Java code – the JVM, the native, the OS layers – but these are typically very opaque.  The network packets, however, are easily visible with wireshark tools.  So we logged every packet.  It took another few days of hard work, but Kevin triumphantly presented me with a wireshark log bracketing the Java failure… and there it was in the log: a broken TCP connection.  We stared harder.

In all these failures the common theme is that the receiver is very heavily loaded, with many hundreds of short-lived TCP connections being opened/read/closed every second from many other machines.  The sender sends a ‘SYN’ packet, requesting a connection. The sender (optimistically) sends 1 data packet; optimistic because the receiver has yet to acknowledge the SYN packet.  The receiver, being much overloaded, is very slow.  Eventually the receiver returns a ‘SYN-ACK’ packet, acknowledging both the open and the data packet.  At this point the receiver’s JVM has not been told about the open connection; this work is all opening at the OS layer alone.  The sender, being done, sends a ‘FIN’ which it does NOT wait for acknowledgement (all data has already been acknowledged).  The receiver, being heavily overloaded, eventually times-out internally (probably waiting for the JVM to accept the open-call, and the JVM being overloaded is too slow to get around to it) – and sends a RST (reset) packet back…. wiping out the connection and the data.  The sender, however, has moved on – it already sent a FIN & closed the socket, so the RST is for a closed connection.  Net result: sender sent, but the receiver reset the connection without informing either the JVM process or the sender.

Kevin crawled the Linux kernel code, looking at places where connections get reset.  There are too many to tell which exact path we triggered, but it is *possible* (not confirmed) that Linux decided it was the subject of a DDOS attack and started closing open-but-not-accepted TCP connections.  There are knobs in Linux you can tweak here, and we did – and could make the problem go away, or be much harder to reproduce.

With the bug root-caused in the OS, we started looking our options for fixing the situation.  Asking our clients to either upgrade their kernels, or use kernel-level network tweaks was not in the cards.  We ended up implementing two fixes: (1) we moved the TCP connection parts into the existing Reliability layer built over UDP.  Basically, we have an application-level timeout and acknowledgement for TCP connections, and will retry TCP connections as needed.  With this in place, the H2O crash goes away (although if the code triggers, we log it and use app-level congestion delay logic).  And (2) we multiplex our TCP connections, so the rate of “open TCPs/sec” has dropped to 1 or 2 – and with this 2nd fix in place we never see the first issue.

At this point H2O’s RPC calls are rock-solid, even under extreme loads.

UPDATE:

Found this decent article: http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable
Basically:

  • It’s a well known problem, in that many people trip over it, and get confused by it
  • The recommended solution is app-level protocol changes (send expected length with data, receiver sends back app-level ACK after reading all expected data). This is frequently not possible (i.e., legacy receiver).
  • Note that setting flags like SO_LINGER are not sufficient
  • There is a Linux-specific workaround (SIOCOUTQ)
  • The “Principle of Least Surprise” is violated: I, at least, am surprised when ‘write / close’ does not block on the ‘close’ until the kernel at the other end promises it can deliver the data to the app.  Probably the remote kernel would need to block the ‘close’ on this side until all the data has been moved into the user-space on that side – which might in turn be blocked by the receiver app’s slow read rate.

Cliff

 

Captain’s Log Days 11-19

Captain’s Log Day 11

It’s another long drive day for us, we’re trying to get from Stone Mountain (near Atlanta) to Harrisburg, PA today – and Chaplain CT sometime tomorrow.  We’re quite expert and breaking camp by now; it takes maybe an hour to pull up all the sleeping bags and fold all the couches and tables back out, to shower and freshen up, to reload fresh water tanks and dump the other tanks.  We spend another hour in a local Walmart replacing basic supplies and then we’re on the road.

The kids have figured out how to keep themselves busy on the drive.  We’ve got a TV and a Wii, and some amount of reading.  There’s singing and tickle fights, and lots of napping.  There’s food-making and grumbling about dish cleanup.  We camp out in the middle of Pennsylvania.  We pass the 3500 miles traveled mark, the 1/2-way point.

Captain’s Log Day 12

We break camp at daylight without waking the kids, and drive maybe two hours before the kids bother to roll out of bed.  RV “camping” is a real trick.  We make it around New York with only 1 truly crazy driver incident; a bright red pickup truck came blazing up the left side and was clearly out of room to pass us, but did so anyways.  He sliced across at a 45-degree angle in front of us. Had I not slammed the brakes and swerved we clearly would have hit the truck; and such a hit would have rolled him.

We finally pull into my Uncle Bill’s farm in Connecticut around 4pm.  We settle the RV, then meander down to the river behind the farm, where one of my cousins is RV camping.  We swim in the river, cook burgers on the campfire and sit around visit until way past dark.

Captain’s Log Day 13

We hang out in the farm all day; some of the kids swim in the river or fish or shoot fireworks off after dark.  I mostly hung out and caught up with the family news.  Shelley & I attended the local church wine-tasting, which was basically a chance to drink a bunch of wines that somebody else bought, and do more catching up on family news.

 

Captain’s Log Day 14

Shelley & I borrow a cousin’s car and drive to Cape Cod for the day.  OMG’s a car is SO much nicer to handle than Nessie!  We take the slow route up the Cape stopping at every tiny town and inlet.  Shelley’s family owned a summer house in Dennis Port 50 or 60 years ago and Shelley was tracing her roots.  We managed to stick our toes in the Atlantic and really unwind.  Shelley & I both like driving, so it’s another really peaceful down day.

 

Captain’s Log Day 15

Up early, we force all the kids to take showers (and change clothes; 2 weeks into vacation and our standards are getting pretty lax) and we hit the road.  Breaking camp is now a pretty standard operation.  By rotating drivers and Shelley driving until the wee hours we make it almost to Indiana.

 

Captain’s Log Day 16

We pull into the University of Illinois at Urbana-Champaign around noon.  I’m giving at talk at 6, and UofI is paying for dinner and 3(!) hotel rooms for us (one for each couple, and one more for the 3 kids).  Real showers for all again!  Yeah!!!  The talk goes really well, its my Debugging Data Races talk and its a good fit for the summer course on multi-core programming.  Shelley and I manage to sneak a beer afterwards.

 

Captain’s Log Day 17

Again we manage to break camp in short order and do another long day of driving through Illinois, Iowa, and Nebraska.  By now we’ve got a rhythm going; Shelley takes the early morning driving shift while everybody sleeps in, then Luke and I alternate shifts until evening (while Shelley naps), and Shelley takes the late night shift.  I think we’re covering around 800 miles in a day.

 

Captain’s Log Day 18

Today it’s the rest of Nebraska and Wyoming, then Utah.  My Dad manages to call me out in the middle of I-80 no-where land, to the bemusement of all.  We hit high winds on and off all day.  At least once I was driving with the steering wheel cranked over a full 180 degrees (and was down to 45 mph) just to stay on the road.  18-wheeler’s would blow by us, knocking us all over the road.  First the bow wave push us hard to the right, on to the shoulder.  Then the wind-block (and my 180-degree wheel position) would drive us hard back onto the road and into the truck, then the trailing suction would pull us harder into the truck – even as I am cranking the wheel the other way as fast as I can… and then the wind would return.  It was a nerve-wracking drive.  Shelley took over towards evening.  Around 11pm the winds became just undrivable even for her.  I was dozing when suddenly we got slapped hard over, almost off the shoulder.  Even driving at 40mph wasn’t safe.  An exit appeared in the middle of nowhere – even with an RV park (mind you, it’s typically 30 miles between exits *without services*).  We bailed out.  All night long the RV was rocked by winds, like a Giant’s Hand was grabbing the top of Nessie and shaking her like a terrier does a rat.

 

Captain’s Log Day 19

Morning dawns clear and still.  We hit the road again early, as we’ve a long drive today.  It’s a quiet drive through to Reno, and then we hit some really crazy drivers again – a combo of construction zone, short merge lanes and stupidity (outside the RV) nearly crushed a darting micro-car.  The construction on the Donner Pass was perhaps even worse; we managed to get forced into clipping a roadside reflector on the right (less than a foot away from the mountain stone versus pushing an aggressive SUV into the naked concrete on his left).  Finally past all the madness we get to the clear road down from Tahoe and through the Bay Area – but it’s all Homeward Bound on the downhill slide through our home turf!

Home At Last!!!

—-

Some parting stats:

We passed through 22 states (24 for Shelley & I, as we also get to count Rhode Island and Massachusetts).
We drove about 6900 miles.
I bought about $3000 in gas, and $1300 in tires.
We saw 4 close family members in Tucson, 7 in Texas, my brother in Atlanta, and at least 16 in Connecticut (I lost the exact count!).
I did about 20 loads of laundry after returning (the washer ran continuously for 2 days).

Cliff