Winds of Change

I’ve left H2O.  I wish them all the best.  I’ve left a longer farewell here.

I’m a loose cannon, at least for a few months, and am looking for (the right kind of) trouble.

So I’m here, posting on my blog again, to see if I can get some suggestions on what to do with my life.  🙂

Here are a few of my crazy Cliff ideas I’m sorta chasing around:

  • Python Go Fast: Do Unto Python as Thou hast Done Unto Java. I hack the guts of Python; add a high power JIT, a full-fledged low-pause GC, true multi-threading support, i.e. make Python as fast and as parallel as Java (about the same speed as C).  This blog is really a request for an open discussion on this topic.  Is the Python community interested?  How does this get funded?  (uber Kickstarter?)  I’ll only go here with the full support of the core Python committers, and general “feel goods” from the general python community – and I’m hoping to start a discussion.  At this point I’m a premier language implementer, and making Python Go Fast is well within my abilities and past experiences. Take about 2 years & $2M for this effort to be self-sustaining (build all the core new tech and hand it off to other contributors).
  • H2O2: I left a lot of unfinished technical work at H2O – and H2O has plenty of technical growing room.  I could continue to contribute to the Open Source side of H2O, with some Big Company footing the dev bill.  Big Company gets kudos for supporting Open Source, H2O gets the next generation of cool new features.
    • Plan B, Big Company funds some new core closed-source innovations to H2O and monetizes that.  H2O still gets some Open Source improvements but not all core tech work is open.
  • Teach: I bow out of the Great Rat Race for a year and go teach Virtual Machines at Stanford or Berkley.  Fun, and makes for a nice sabbatical.  (as a bonus, I’ll probably have 3 kids in college next year, and the whole Stanford Pays Faculty Kids’ College thing sounds really good).  Might could do this while hacking Python guts at the same time.
  • Jetsons: I know how to Do the Flying Car Thing Right.  Million people, million flying cars in the air, all at once, and nobody can crash anything.  Feels like you’re flying but the Autopilot-o-Doom holds it all together.  Got figured out how to handle bad weather, ground infrastructure lossage (e.g. the Big Quake wipes out ground support in 10sec, how do you land 1million cars all at once?), integration into existing transportation fabric, your driveway as a runway, playing nice with the big jets, etc.  Bold statement, needs bold proof.  Lots more stuff here, been thinking on this one for a decade.  Needs 3 or 4 years to put together the eye-popping this-might-actually-work prototype.  Thus needs bigger funding; probably in the $10M range to get serious.
  • Something Random: by which I mean pretty much anything else that’ll pay the bills and is fun to do.  I got some pretty darned broad skillz and interests…

Gonna be a long-needed Summer-o-Crazy-Fun for me.  Heck, maybe I’ll post my 2nd tweet (shudder).  🙂

Cliff

 

49 thoughts on “Winds of Change

  1. Regarding the Python thing, that’s clearly possible in light of the enormous speedups JavaScript has been seen. I always wonder why Python, Ruby and PHP are not pursuing similar approaches. A basic implementation of runtime specialization seems to be not out of reach.

    Here’s a request: Do to .NET what you did to Java 🙂 They have no clue. Look at the coreclr repo. I wonder why they are not at least introducing multi-tier JIT. They always lament the fact that startup time is slow. So add an interpreted tier one and a high quality tier two.

    Their optimizer seems to be tree-based internally which causes innocuous code changes to produce different code because it breaks pattern matching. Introducing a temp variable can actually change code gen meaningfully.

    • Wow, painful. Yeah, I didn’t do trees for the Java-to-machine-code, when straight to a low-level IR (that was easy enough to find the loops in, then do the loop opts).

      Cliff

  2. While Python could certainly use the help, Google tried that already with unladen swallow, and was never able to get the obvious changes accepted. Pythonistas have essentially worked around that with Cython and PyPy, but I never understood why they had to.

      • You might want to look at the .NET on LLVM work called LILC (I believe). They build a .NET JIT on LLVM right now. GC, EH and low throughput seem to be bugging them.

        Andy Ayers did a talk.

        • Interesting. Common themes here are: using LLVM, using tracing JIT instead of HotSpot’s method-style, language doesn’t have strong types, and having trouble with e.g. GC & EH and not getting desired performance levels.
          – Using LLVM, seems like as a compiler it “ought” to be strong enough
          – No strong types – defeats a lot of common compiler optimizations (your alias analysis goes to sh*t in a hurry, HotSpot only ever did basic type-equivalence classes and that got 99% of any uber-alias-analysis ever would. And… HotSpot added on various kinds of type-specific further analysis; not sure if LLVM has e.g. Class Hierarchy Analysis – or if it matters in an untyped world. Failing good basic pointer analysis immediately fails out tons of follow-on optimizations
          – No strong types – requires gobs of type-checks. These need to be baked into the compiler in a truely fundamental way. You need naively lots for correctness; the compiler must totally integrate these into all it’s other optimizations directly – to have a chance at removing enough that you get some “working room” for the optimizer to make progress.
          – GC: Must have it baked into the compiler all the way through. Especially derived pointers (for good array loop performance) have to be baked into the compiler early on. I suspect there’s a bad division of labor here between LLVM and the GC runtime.
          – EH: HotSpot did this custom all the way, and with enough runtime-grief managed to not have it impact JIT’d code quality (unless you actually took the Exception). Again, I suspect a bad division of labor.

          Definitely I think I can do better here!
          🙂
          Cliff

  3. Been using Python as my primary language since around 2001. Production Python code uses A LOT of extensions written in C. General consensus seems to be that C extension API would prevent any meaningful speedups unless the API is dropped and all of the C code is rewritten. Which people seem to believe is not doable. (You can look into how PyPy tries to cope with this).

    Also, from my experience with CPython core devs I think you’ll have a lot of idiotic pushback to your proposals no matter how reasonable.

    So while I think you can definitely do it, but I would not count on “general feel goods”, and it’s probably not worth it in that respect.

    • Java went through that at some point – getting the C-vs-Java GC thing figured out required some painful C-wrapper rewriting, and *some* C code had to be redone (but definitely not all) – if you didn’t “play nice” with GC pointers you lost out when Java moved to a better/faster/lower-latency GC.
      Cliff

      • In Python land the incentives seem to work the other way here: If your implementation doesn’t play nice with the existing C API, you won’t get anybody to use it, because everybody’s using a boatload of them.

        • Which brings us to the question of making forward progress in the land of single-threaded Python.
          Java answered this question something like 10 or 15 years ago.
          Suppose 90% of C packages are not multi-thread-safe, and there’s no way to tell if they are or not up front.
          Suppose there exists a true multi-threaded Python, which happily wants to call native C code in parallel.
          Some packages crash (rarely), some are fine.
          How do you move forward?
          Suppose you could declare, on a package-by-package basis, whether or not that package was thread-safe – perhaps on the command line. Then those that were not thread-safe would take the equivalent of the Global Interpreter Lock (one package at a time, not even 2 unrelated packages), and the safe ones could charge ahead in parallel. Over time thread-safe versions of packages might appear, and then be allowed to run in parallel.

          Cliff

        • There’s the other question then, of having implementation details totally locked down to support the existing C API.
          Java pays a JNI cost: pointers are marshaled, other values expanded to their default primitive values. No objects are passed by structure, and the native code cannot manipulate structures directly, but only through call-backs. Painful – except that nearly all such cases don’t need a high-performance back-n-forth between Java and C – so the clunky indirect calls aren’t a speed issue. Arrays of primitives (not pointers) are special cased for various kinds of bulk operations. Pointers have to obey the rules every time, so that GC implementations can be flipped out from under you.
          Cliff

          • .NET has picked a nice trade-off here. Almost everything is marshaled without any expensive conversion. Primitives are passed as primitives. Value types are passed as pointers. References are passed as the underlying pointer and the object is pinned for the duration.

            Interestingly, strings are also passed by ref. A .NET string is an intrinsic type with a length prefix and a null-terminating char (that char is not observable from managed code).

            You can pass a StringBuilder as a mutable string. Not sure how efficient that is.

            Arrays of value types can also be passed by ref without marshalling work.

            This design was chosen for easy interop with the Win32 API and COM. Works extremely well in those spots.

            OS handles also have a nice marshalling mechanism (Google for SafeHandle; solves finalization problems for handles).

            Mentioning this because a lot of it might apply to Python native bindings.

          • I should mention that I’m an experienced .NET guy. I love .NET. I’m just very envious of the Hotspot JIT 🙂

            Just to point out how bad the .NET JIT is: When you say a.x + a.x that actually loads x twice. They need your help, Cliff 🙂

  4. The core Python people are quite attached to their existing approach, but the PyPy people would probably welcome some help

    • I suspect it’s going to have trouble getting traction – even if it gets all the performance of the original compiler. Rewrites that generally don’t add some significant new value generally have problems. I understand that it’s allowing good cross-language performance, so maybe something comes of it yet.
      Cliff

    • Been there, Done That with Azul Systems. Twas a sweet ISA to be sure, fun and easy to JIT to, lots of small shortcuts important to compiler folks…
      Needs a real business model in the X86 Era.
      Cliff

  5. Have you taken a look at PyPy? They are by far the fastest Python JIT available today, and the concept of a meta-interpreter JIT is something I have always found fascinating.

    • I’ll look again. Last I looked (some time ago) they were far far off the HotSpot mark.
      I stared at meta-interpreter JITs for some time, they look really cool in theory.
      Alas, they don’t come close in performance.
      Cliff

  6. I don’t have a lot of faith in a fast python as long as they cling to their C extensions and reference counting. The most productive efforts have had to abandon both, and as a result nobody uses them. The Ruby world has C ext issues as well, but we’ve managed to build enough of a community around JRuby to get most of them replaced.

    What I really want is a JVM that can do the dynamic language optimizations we’ve wanted for years. We need partial escape analysis, better method specialization (especially in light of closures), more flexible method sizing/loading/lifecycle, and a better optimization curve for invokedynamic and method handles. We already beat CRuby (a bytecode interpreter) but we should be 50x faster, not the 3-5x we can boast today.

    • Re:Python & ref-counting – I think I get what is useful out of ref-counting (exact short lifetime management and thus exact destructor execution) and keep the speed – needs compiler hacks to watch and mimic the ref-counting lifetimes, seems doable.

      Re: JVM hacks: I did some of this at Azul, now caught behind their paywall. I could do it again.
      Really needs a biz model, or at least some reasonable payout for me.
      i.e., I can do these things, but who will pay?
      Oracle? Kickstarter from the Ruby community?

      Cliff

  7. How about with Julia lang ? [1]

    Even though Python is the lead running for scientific computation but Julia support for parallel operations and better typed system should get some attention for next generation language for ML or scientific computing.

    Best of luck after H2O.

    [1] http://julialang.org

  8. Cliff, Been enjoying your posts for years.

    I’ve recently been looking at the “modern language” space, with “go”, “rust”, “D”, etc. My bias is that compilers are good at checking types and both rust and go have shown that you can have an expressive and small language do a lot. You should check out what is happening there.

    With new languages, one also needs better database technology. One that strikes a resonance with me is “cockroachDB”. Highly distributed, reliable. Someday might have SQL (not that I care about that).

    There is much for someone with your skills to dive into.

    • I’ll probably talk to the “go” and “rust” folks soon.
      Never heard of cockroachDB, I’ll check it out
      Thanks
      Cliff

  9. How about a distributed compute network backed by cryptocurrency? MaidSafe / Ethereum open source projects look like they have the potential to change the way we run apps in the cloud

    • Both of those look like fun, and potentially very useful.
      In particular, there’s the potential for a Real Cost Model – i.e., $$$ instead of time (which is always the real cost).
      – In C, a variable reference is a ‘load’ op, which has a known cost: 1 clock (cache hitting), or sometimes 1000 clocks (cache miss to main memory), and is sometimes free (hoisted out of a loop or commoned with another load).
      – In Java, a variable reference started out as an interpreted bytecode – 20clocks of dispatch logic, and never hoisted. With the advent of JITs the cost goes back to the C model (plus rarely a null pointer check overhead).
      – In all cases the REAL cost is time
      – In these new VM’s you can (must?) pay for execution with virtual $$$ (safecoin or either).
      – Which means you can swap virtual $$$ for actual time (by paying more $$$ you can encourage others to run your stuff on their computers) – a bidding system for otherwise wasted but available idle compute cycles

      Its a very interesting model, I’ll be watching them now. Thanks for pointing them out!

      Cliff

  10. I prefer you to focus on flying cars. Forget that computer programming stuff. Issue a 3rd tweet when you get it solved.

  11. a ref counting gc for open jdk (with conc m&sweep to catch cycle islands). For modern nlp/bigdata analytics large heaps (beyond 0.5TB) are required. regardless how much overhead refcounting will induce, it will be much better than repeatedly full gc’ing those large heaps (e.g. my current nlp/ngram analysis spends 90% of time gc’ing a 80gb heap). I know one could apply tweaks to improve this, but still ..

    it might well get financed via kickstarter if you can get help of some popular vm heads like m. thompson or peter lawrey

    • H2O keeps Big Data in giant primitive (byte) arrays. These are very cheap to FullGC; typically we see 1sec of full gc per 100GB of data (i.e. 2-3 secs on a 250GB heap) using the default old-school GC and totally default args.

      The Azul GC (for which I proudly played my part) has pause times independent of heap-size; I believe they are at the low-microsecond pause times for 250GB heaps.

      Cliff

  12. Hi Cliff,

    Regarding PyPy, I think that lot of people would like to know your point of view and personal consideration about the current state of the project.

    Simone

    • My overview, based on 30sec of reading (feel free to correct mis-perceptions):
      – There’s a lot of performance left on the table with the current JIT & runtime setup; probably 2x or more. A 2nd-tier heavy-weight JIT will surely be helpful
      – There’s a lot of performance left on the table with the current GC; it “ought” to slow you down by no more than 5% always, independent of allocation rate and heap size (and timely __del__ execution can/should work at the same time)
      – I’m not sure I’d use STM’s for multi-threading support. Python needs something here, yes, but STM’s have “issues” especially with false-positive sharing. Closure did something smart here, by demanding the programmer declare the shared STM’able variables (“ref”s in Closure I believe). Bite the bullet and demand a language change for irregular multi-threading. Make the standard collection ops all parallel under the hood.
      – The amount of monies you mention above are not enough to pay my bills. 🙁

      Cliff

  13. Here’s a list of items that the PyPy project needs to have implemented:
    a) You could complete the Python 3.3 version of PyPy. It will allow you to be familiar with how PyPy works in order to implement other features for it.
    b) You could work on making the Python 3.x version of PyPy as fast the Python 2.x version of PyPy.
    There is around $8000 left in the Python 3.x donation pool and I’m sure you can talk with Armin Rigo about this.
    PyPy has a variant that supports Software Transactional Management in order to allow threads to run in parallel.
    See http://pypy.org/tmdonate2.html#work-plan-and-funding-details
    c) You could improve the PyPy-STM’s GC because right now it’s a performance bottleneck for that variant of PyPy.
    d) You could improve the PyPy-STM’s JIT to automatically replace lists and dicts with their safe STM equivalent.
    e) You could implement a Python 3.x version of PyPy-STM.
    There’s around $20k in the pool for STM.
    f) You could implement PyParallel threads for PyPy (https://github.com/pyparallel/pyparallel).
    g) You could combine PyParallel threads with STM by making the JIT optimize STM based threads to PyParallel threads if the threads do not share any global state.
    h) You could try to cache the traces and optimizations that the JIT emits. Dropbox already achieved this with their JITed implementation.
    Join us on irc.freenode.net’s #PyPy channel if you want to hear more.

    • > Join us on irc.freenode.net’s #PyPy channel if you want to hear more

      Never used freenode; I tried to google my way there but only got blank pages.
      Can you ship me an actual live link?
      Thanks
      Cliff

  14. This is a bit pie in the sky, no idea how feasible it is, but I think your experience with languages would make this interesting.

    I’ve always wondered if you could build a database, where you define a schema of some kind, no indexes, no partitioning scheme. And in the process of using the database, some sort of “JIT” would generate indexes on the fly (or partial indexes, or whatever is needed to make the queries go fast), and reorganize the data based on your query pattern.

    • I suspect a lot of DB-like things make indices on-demand. Not sure if they re-org the data directly though. R’s data.table does exactly that (make & cache indices on demand).

      • Thanks for the response Cliff!

        So in my experience most don’t make indices on demand, you need to specify them manually, and they generally don’t reorg the data unless you change the clustering index (this is for your standard relational dbs, Oracle et. al).

        The heuristics on when you index a column, or multiple columns, or re-shard things (if you are distributed) get interesting when you have a lot of things in play. You end up playing a lot with different configurations to get queries to do the right thing.

  15. Teach the next generation about virtual machines at Standford/Berkley, but instead of hacking on python on the side, maybe hack on red: http://www.red-lang.org/p/about.html (not really serious, but would love to see a fast rebol like language with a small runtime to see all the discussiouns abut it on the web).

    A little bit more serious: Scala will get a native target in the near future (a first presentation about it will be held at Scala Days in New York, May 13th), which will probably not go the Rust way and still use a GC. Since Scala is used by some large banks, maybe they would be interested in sponsoring work on Scala.native.

Leave a Reply

Your email address will not be published. Required fields are marked *