Java vs. C Performance….Again.

I just foolishly got caught in a You-Tube discussion on Java vs C performance.  Foolish because You-Tube comments are a lousy way to present anything and because it’s hard to keep the level of discourse scholarly.  And foolish especially for me because I’ve had this discussion so many times and it always comes out the same way… so here’s my attempt at distilling my arguments into something I can point people the *next* time I get caught in this silly discussion.

 

 

Is Java faster than C/C++?  The short answer is: it depends.

Places where C/C++ beats Java for obvious reasons:

  • Very small footprint is required (these days that does not include most phones).  You can get JVMs that run well in a few hundred KB.  Sometimes that’s too much.
  • Very small startup time (as opposed to very low response time on a well-warmed-up JVM).  Things like pacemakers (if mine takes a cosmic-ray-induced reboot, I want it restarted pretty darned quick!), or perhaps military gear (e.g. guided missiles).  Note that this does not include e.g. long running hard-real-time airplane control; I know that at least one UAV uses Java as it’s primary control mechanism.  Startup of a JVM in microseconds is a very hard problem; startup in milliseconds might be vaguely possible; but a more common time-frame for a small program to get JIT’d is measured in a few seconds.  Once you are past the profiling & JIT’ing stage micro-second response times are entirely doable.  Flash games beat Java games mostly because it took 30+sec to load the JVM from disk… and so now the web-game developer community has settled on Flash as the standard (and it still takes 10+sec to load the JVM). 
    [BJ81: I DO care if my IDE takes 10 seconds to start as opposed to 2. I DO care if my word processor or computer game of choice takes 10 seconds to start as opposed to 2. Startup speed is an important component of the user experience for all end-user software.] 
    [Lots of other people complain about loading time]
    My IDE stays up for days…and all my computer games take more than a minute to load already.   But yes, I like faster loading.  This mostly depends on things like disk speed… and the implementation of Java as a large (on disk) JVM and not a lot on things like the actual language or JIT’ing. 
    [SS: you can try JetBrains IDEA. From my experience it’s faster than Eclipse, less footprint, no lockups on GC. It’s Swing 🙂 The only problem it’s not free. The real problem with perceived Java performance is that none seriously optimized client Java performance before the most recent time.]
    Also my experience with JetBrains.  It’s amazingly fast. 
  • Optimizations beyond “gcc -O2”, such as tiling & blocking for registers or cache.  These optimizations pay off handsomely for dense linear algebra codes but no production JVM that I know of does them (some research JVMs back onto common heavy-weight optimizers which include these optimizations).  Auto-parallelizing compilers also fall into this realm.
    [DB: One article that may be of interest to you regarding Java’s transcendental performance: http://blogs.sun.com/jag/entry/transcendental_meditation The basic message: programs that extensivelly use transcendentals (sin/cos) will experience notably slower performance in Java due to the “correct vs. fast” implementation.]
  • Value Types, such as a ‘Complex’ type require a full object in Java.  This has both code speed and memory overheads.  Note that there are theoretical optimizations for both (Object Inlining) but implementations available in production JVMs are very weak right now.  See my comment below about rotating arrays-of-small-structs 90-degrees to a small-struct-of-arrays as a workaround.
  • Direct machine access, as is common in embedded systems.  Memory mapped I/O, etc.  Note that JNI goes some of the way to addressing this problem but JNI is clunky to use and definitely adds some overhead crossing the C/Java boundary.
    [FR:Java only supports one floating point rounding mode, so it’s not suitable for scientific applications. This might fall under “direct machine access” but FP rounding modes are really machine-independent because the IEEE standard requires them. “How Java’s Floating-Point Hurts Everyone Everywhere”http://www.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf]
    I’m not so sure how Everyone got hurt: working the simpler subset of FP allowed Java the time to get everything else right.  Had JVM’s been forced to implement the whole spec well, we not never have gotten the JVMs we have now.
  • Interpreters.  Interpreters written in pure Java or pure C appear to be mostly equal (I’ve got very few cases where both are written in a pure style), but it’s easier to cross over into the non-pure realm and get 2x speedups from C.  gcc label vars are the obvious use-case here, getting a 2x speedup over pure C in my experience.  Dipping into pure assembly gets me another 2x speedup.  Java’s “Unsafe” classes allow access to e.g. “compare-and-swap” instructions plus unchecked ‘peek’ and ‘poke’ but do not support code-gen or hand-made control flow very well. 
  • On the fly code-gen-&-execute.  You can make bytecodes in Java & and execute them but it’s somewhat more difficult than the same machine instructions on a simple RISC chip… and you need the JIT to do a good job with your bytecodes (there are decent libraries to make it easier but it’s still harder than just doing some bit-twiddly thing and slamming bits into a buffer).  Sometimes what you want is to gen some specific machine instructions that do not resemble bytecodes (see the above comments about direct machine access) or to have fine-grained interleaving between the gen’d code and the static runtime bits (I’ve done sort routines which gen’d the key-compare sequence from a command-line description before sorting and called that from the inner sort loop).
  • OS’s.  These need lots of direct machine access (e.g. hardware interrupt support; page table support), but they also dork with standard execution stacks (like interpreters do) and also load code and execute it (see above comment about making & executing code).  Yes there have been some brave attempts at a pure Java OS, and Microsoft has had a similar research project in this area for a long time which has made interesting inroads into the problem.  So this might be a ‘solved’ problem some day.
  • “Direct Code Gen from C”… carefully hand-crafted C code, such that the author knows (and plans for) which exact machine instructions will be generated from the C compiler.  This is harder to do in Java because the translation layer is much more indirect (I’ve done it successfully, but I know a *lot* about the JIT).  Examples are things like kernels of crypto loops, or audio/video codecs or gzip/compression routines.  Small bits of very hot code with complex and unusual control flow where the author knows a lot more about what’s going on in the code than the compiler does.  This kind of coding obviously does not scale to the 100KLOC program but works very nicely for things that are both very hot and compartmentalize into libraries well.

 

[MB: Any tool has its own advantages and disadvantages. You simply cannot say Java is superior to C/C++ because Java is written with C/C++ ;-).]

Actually there are several Java-in-Java’s Out There and some are not too far off HotSpot’s mark.

[FG:  I think you’re missing two important issues with Java. One is memory consumption. Java programs use at least twice as much memory as C++ programs because pointers (called references in Java) are used for everything and all strings are UTF-16.]

[AM: Java requires 2.5 times the memory a C++ program requires.]

Yes and no…  Yes in general Java uses more memory than C++ but it’s usually more due to coding styles and data structure choices than pointers-vs-primitives and fat strings.  Whenever I see the 2x data bloat it’s always due to *really bad* data structure choices.  With reasonable choices the bloat is far far lower… and memory is cheap.  I’d really like to see some hard evidence here: really equivalent implementations of larger apps in both C & Java.

[FG: It gets even worse on 64 bit CPUs. For some applications like in memory databases, in memory data analysis, etc, this leads to an orders of magnitude performance difference because the disk has to be touched more frequently.]

I assume you mean 64-bit JVMs and not just 64-bit CPUs…

I guess I’d argue that you just mentioned a strong point of Java: with 64 bit JVMs you can have heaps of 100’s of Gigs of ram (and ram is cheap!) and cache all those DB accesses and touch disk less often.  Azul Systems routinely sells gear to do exactly that.  To be fair, the GCs on X86 JVMs do not handle such large heaps well.  Some Day they will, and then you’ll be delighted to have a 64-bit JVM, pay $49.99 for a few hundred Gigs of ram and skip the disk.

[FG: The second issue is cache locality. Obviously using only pointersinstead of values in all collections (other than arrays of primitives) leads to a lot more cache misses.]

64-bit VMs with 64-bit pointers get pointer-size bloat, and that appeared to cost Sparc about 15% in performance. X86-64 picks up double the number of GPRs which offsets the cache footprint lossage somewhat.

[IJ: Hey Cliff, Regarding the comments about memory usage on 64-bit JVMs, compressed references can help for heaps smaller than 32GB on HotSpot.]

[FG: So I would say C++ is a better choice for any application that benefits from keeping a lot of data in memory. By the way, C# is vastly better than Java for such applications due to its value types.]

Yes and no again…  For many applications you’ll pay only 1 more cache-miss per entire array.  For arrays-of-small-structs (e.g. arrays of Complex), you are correct: Java’s lousy there (and I added that to C’s strengths).  When doing performance sensitive arrays-of-small-structs I turn the implementation 90-degrees and implement a small-struct-of-arrays.  It’s clearly a work-around over a clunky language issue… but the performance is solidly there (both in memory footprint and in access speed).

Places where Java beats C/C++ for obvious reasons:

  • For Most Programs Runtime profiling is able to pay off well.  This is true ofmost large Java apps; one obvious case is where there’s a generic interface but at runtime only one implementation is ever used: after profiling the JVM can turn the interface call into a direct call and even inline the call.  It’s well known in the C/C++ world that peak benchmark scores come from using the profiling passes these compilers support, but it’s pain to use them in practice… so probably 99% of all C/C++ programs compile without the benefit of profiling.  In practice, all Java programs are well profiled and this profiling data allows for better code generation – sometimes *much* better code generation.
  • Very large programs (1MLOC +).  This is totally anecdotal on my part but my not-very-rigorous survey of the industry tells me that the Java large-program tool-chain (and language features like GC) are more robust and complete than the C equivalents and this allows teams to write larger programs quicker than they could in C/C++.  Yes large programs are written in C/C++, and yes they get the memory usage “right enough” that the programs run usefully well… but the same program written in Java appears to come together quicker, with fewer bugs and a shorter overall development cycle.  I see a *lot*more 1MLOC Java programs than C ones and it isn’t because Java programmers write fluffier code (which might also be true…) these large Java programs are really doing a “bigger” job than what can be squeezed into 100KLOC of C code.  In this case “Java beats C/C++” really means: we can’t afford to build these systems in C/C++ but we can in Java… so there isn’t any C/C++ equivalent.  Where’s the C/C++ version of WebSphere or WebLogic?  Maybe somebody Out There can tell me the state of C/C++ Application Servers…
    Got a comment that similar functionality comes from a bunch of separate cooperating C processes.  Not sure I believe that, as I haven’t seen anything close to the level of integration e.g. WebLogic has in the C world.
  • Garbage Collection is just easier to use than malloc/free.  This is well documented in the industry, and yes it’s not entirely “free”.  Yes the heap needs to be managed in production environments, leaks are still an issue, GC pause times are an issue, etc, etc… but overall it’s vastly quicker to write using GC and in the time saved make the program more performant or resilient to GC pause issues and you’ll come out far ahead.  (I’m blithely ignoring all the C/C++ hand-rolled memory management techniques like “arenas” or “resource areas”; these fall into the category of “malloc/free is so hard to use so we rolled our own poor-mans GC but if the language had GC we would probably have never bothered”).
  • GC allows for entirely different algorithms, isn’t just easier to use than malloc/free.  Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free.  Reference counting is commonly used in these situations in C/C++ but it adds a lot more overhead (sometimes a *lot* more overhead) and is much harder to get right: e.g. a common mistake is keeping the count in the structure being guarded.
  • Very Good Multi-Threading Support.  Parallel programming is just easier in Java.  
    [SS: Next dimension of Java performance is the easier access to parallelism. Using j.u.concurrent I can beat C/C++ most of the time just using more processors in my 16 core server. Concurrent memory management by hand is a pain in the ass and using GC kills the point of using C/C++]

 

Places where C/C++ proponents claim C beats Java, but it doesn’t appear (to me) to do so:

  • Most ‘plain’ modest sized program.  This will be programs requiring no more than the “usual” compiler optimizations and are not so tightly constrained by machine size or startup time.  Examples might be things from simple compute-bound loops (string hash, compress) up to IDEs & editors (and most visual tools); DB cache/drivers, etc.
    [SS: For the example where Java beats C/C++ a number of times visit www.caucho.org and their OSS PHP engine Quercus. You can check the numbers yourself using my http://code.google.com/p/wikimark/ For the example of super-fast memory DB in Java visit http://www.h2database.com it beats MySQL both in performance and footprint 🙂 Of course it’s specifically tuned for in-memory use. And Java is just an example of so called Managed Runtime (Microsoft term). If we are talking about Java performance we are mostly considering JITed code performance. But JIT can be effectively applied to C/C++ as well, see http://llvm.org Apple Mac OpenGL implementation is based on LLVM and all OpenCL implementations too. So anyone running their games on Mac or planing to use OpenCL will use the JIT. ]
  • 16-bit chars vs 8-bit chars.  It’s true that ‘char’ in Java is similar to C’s ‘unsigned short’, but ‘byte’ exists and is similar to C’s ‘signed char’.  This confusion appears to confuse some number of new C-to-Java converts, but many of the old C tricks for dealing with different sized data types work great in Java and the JIT is very good at bit-fiddly code.  There are generally lots of Strings in big Java programs and this leads to code bloat, but if you deal with high-volume String data and it’s all ASCII (no internationalization!) then converting the data to byte arrays makes sense.
    [MI: Internally, bytes are stored as 16 bit integers but cast down to only 8 bits.]
    No, ‘bytes’ are stored in memory as 8 bits always, ‘chars’ and ‘shorts’ and 16 bits.  CPU register contents always depend on the JIT’ing of the moment – for both C and Java.
    [MI: it’s necessary to add a check for the state of the byte and a math operation to correct it prior to every byte transformation, and a check for the state afterwards.]
    Just mask the results (no “testing” or control flow), or use byte-typed variables.
  • Raw Small Benchmark Speed – This is actually mostly a non-issue, as Real Programs rarely look like these things, nor run for <1sec, nor have all their time spent in 3 lines of uber hot code, etc… But Java still looks fairly good here, despite the general static-compilation bias built in to tiny short running programs:
    [SS: First about Java performance. Java is the second fastest mainstream language after C/C++ in the Benchmark game, see http://sho otout.alioth.debian.org/ And it’s less than 2 times slower than C/C++ in this mostly numerical benchmark. It could also be notice that JRuby is faster than native Ruby. Latest versions of Sun JVM efficiently use SSE and are comparable with C in FP performance. http://math.nist.gov/scimark2/index.html]
  • First-Person “Shooter” PC Games [Retracted: I’ve gotten several well-written posts explaining how games have changed over the past twenty years.  :-)] using the same game card & interfaces (e.g. DirectX).  Most of the heavy lifting is done by the game card itself; the game engine is more about managing other resources and running the game state & AI’s… all of which seems to me to work nicely in Java.  I’ve met at least one person working this approach for-real (and it was *working* for real, but I haven’t kept in touch so I don’t know how it came out in the end).   

[AD: Given that most games use ‘Optimizations beyond “gcc -O2″‘, and tend to have an interpreter or scripting language, and often require ‘Direct machine access’, plus plenty of tricky computationally intensive maths, that would put them squarely into the world of C++. Especially any game designed for a console, or handheld device.]

For PCs: I’m still thinking Java is up to the task.  I’ve yet to see games that needed BLAS-style routines; simple ‘saxpy’ style loops should come really close to C performance (not heavily tested!  But I’ve routinely talked people into testing Java FP performance and routinely had them come back with a positive report).  If ‘direct machine access’ is limited to a handful of graphics-card calls per frame (so hundreds to thousands per second), then JNI can handle that no-problem.  The games I worked on long ago didn’t use any scripting; back then we would have used Java instead of scripting, so I don’t have a feel for how crucial scripting is to game development.

For consoles & handhelds perhaps; I did games on console like devices a long time ago (20+ years) and my vague recollection is that if a modern JVM could be squeezed into such a device it would be able to do the job. I weakened my assertion to just PC games.

[TNT: A surprisingly large part of a game is performance sensitive and requires C code. Many games are CPU bound (not GPU bound as you suggest).

Ahhh, but exactly what I am showing here is you cannot equate ‘performance sensitive’ and ‘requires C code’.  Java is up to speed for most (if not all!) of the performance sensitive parts.

[AJ: AAA games now typically use middleware or custom physics engines that have highly-tune collision detection code (also used by game “AI”) and custom nonlinear solvers. Both are often SIMDed, prefetched and cache blocked, with the CD sometimes doing bit-fiddly decompression, and the solvers sometimes using custom code gen (for derivatives and such).]

JVM: SIMDed: 1/2 yes, prefetched: yes, cache blocked: no.  Perhaps closer than you think but probably not close enough.

[AJ: Some PC games push 5-10K draw calls per frame, with very roughly 4 additional state-setting calls per draw. So @60fps, that’s something like 1.5-3M/sec. Games are often single-thread limited on this alone.]

The obvious fix is to batch the graphics calls per JNI call, but this starts to look like a hybrid C/Java solution and those rarely look pretty.

[AJ: PC games are also chew plenty of CPU with real-time decompression (unzip or custom) and sometimes recompression (say JPG->DXTC).]

Azul Systems can’t use the X86 ZIP routines so we went with the Java ones: performance was about as good as ‘gcc -O2’ and it was easy to parallelize.  After parallelization the Java version was as fast as we cared to add CPUs.

[JJJK:  As for Java and games, it actually works better than most people think. No problem for indie games. But there are some issues:

GC pauses ruin everything. Smooth framerates are difficult to achieve with the usual Fully-OO-Java-Programming-Style. Also the resources on the GPU are not garbage collected, so you’ll have some kind of paradigm-clash anyway.

If you want do do some transformations on vertex arrays on the CPU, you’ll have to do them on direct byte buffers, since they are the only arrays that can be sent to the GPU. Or do them on java arrays and copy them into byte buffers, I have no idea what that does to performance though.

3D-Vector math in java is plain ugly. You can either make it readable or fast. And if you don’t pay attention, it will be neither.

On the other hand, with more data and computation going to the GPU (and staying there for the most part), Java is at least becoming moderately attractive for games. I worked in a company which is now starting to release web-based 3D java games.]

“Flaws” in most Java-vs-C performance methodologies:

These are ways in which many many people (wrongly) claim Java/C/Ruby/etc is faster than C/Java/Python/etc.  Sometimes these issues aren’t flaws at all, but instead point out conflicting basic requirements that truly make one language superior to another for a particular task.

  • Warmup.  Sometimes no-warmup is important (see comments above about pacemakers), but more often a short warmup period is irrelevant to the overall application.  If I’m using an IDE, I expect a largish loading period… but then I’m using the IDE all day. I don’t use an IDE for 0.1 sec.  If warmup is NOT important to the application, then allow the JVM a warmup period before comparing performance numbers.  Many of the benchmarks in the language shootout athttp://shootout.alioth.debian.org fall into this camp: too short to measure something interesting.  Nearly all complete in 1 minute or less.  A very large set of Java programs (and programmers) write server programs that run for days and a combination of throughput and quick response under load are the key measures.
  • Not comparing the same overall algorithm.  This is common for larger programs where exact Java & C equivalents do not exist… sometimes one version or the other gets saddled with a really bad implementation.  And sometimes you just can’t do it but people try to “fake it” with a straw-man for the “other” side.  E.g. any-GC’d-language doing a bunch of concurrent algorithms versus non-GC’d language, or direct code-gen especially of unusual machine instructions (e.g. X86 codecs using MMX).  Again the language shootout suffers this problem… and so all results have to be carefully inspected.
  • Nyquist sampling or low-res sampling errors.  e.g. using a milli-second resolution clock when reporting times in the milli-second range.  Both Java & C have common timer APIs reporting times below the millisecond (micro & nanos), but actual real hardware & common OS’s vary widely with what they implement. 
  • Broken statistics.  This is a hard problem in Java and easy to get wrong for subtle reasons, but people get it wrong in C/C++ and other languages also.  Running anything *once* suffers from a huge variety of startup issues.  Re-running in the same program gets you past one issue and into another: the JVM/JIT “compile plan” will vary from run-to-run.  Within a single run you might repeatedly get “X frobs/sec” (say for 10 runs in a row in the same JVM launch) but if you restart the JVM you can easily see “Y frobs/sec” reliably (repeated 10 times in a row in the same JVM) one-in-10 runs.  This kind of variation can only be managed with proper statistics. See “Statistically rigorous Java performance evaluation.
  • Flags, version numbers, environmental factors all matter: “java” is not the same as “java -client” or “java -server”, and might make a 10x difference one way or another.  Same as using “gcc -O1” vs Intel’s reference compiler set at the max “-O” flag.  Linking your C program as “ld b.o a.o” can give a 30% difference from linking as “ld *.o” (link order affects I-cache layout).  Environment variable sizes (i.e., length of your username) can push the initial stack positions up or down to where the stack collides in the cache or not, etc.  See “Producing wrong data without doing anything obviously wrong!“. Again, well reported numbers and good statistics are your friend here.
  • Varying dataset sizes: I’ve seen test harnesses that re-sized the dataset to keep the runtimes approximately equal to 1 second, but once the dataset no longer fit in cache performance dropped by 10x.
  • Claiming ‘X’ but testing ‘Y’:  examples might be SpecJVM98 209_db claiming to be an in-memory DB test, but really it’s a bad string-sort test, or writing a single-threaded program to test OS locking speed (with JVMs, uncontended locks will never hit the OS) or the Caffeinemark logic test (with a little loop unrolling the code is entirely dead).  See more examples here. Modern larger benchmarks do fairly well here but many microbenchmarks run afoul of this.

[SG: One thing I might suggest is dropping your local memory caches before performing. Java gets a obvious speed boost beacsuse the JVM code tends to be cached, and the C code also has the same.

sync && echo 3 > /proc/sys/vm/drop_caches

Then run each program twice. First one giving you the time it takes to access and load the JVM and other libraries included, and the second giving you what it takes once the libraries and everything is cached.]

No.  Run at least *30* times to get a decent statistical regression.  See the above papers on the ‘run it twice’ methodology is *seriously* flawed.  However dropping the local memory caches probably helps.  Here’s a nice writeup on Java perf testing issues: http://www.ibm.com/developerworks/java/library/j-benchmark1.html.

 

 

Some Commonly Mentioned Non-Issues:

  • [AM: C++ stack allocation beats Java GC for allocation of small objects.]

    Does not.  You have evidence?  I have evidence that small object allocation in Java is darned near free… but not totally free.  So C++ might win here but not by any interesting amount.  And HotSpot does stack allocation when it can.  You should do some testing here to convince yourself.

  • [AM: Java apps have lots of dynamic casts in them.]

    Yes, and it’s free.  Really 90% of all casts are removed by the JIT and the other 10% take a ‘load/compare/branch’ sequence which is entirely predicted by X86 hardware and runs in 1 clock cycle.

    [EXO: This is pretty off topic, but I’d like to know how you’ve reasoned that a load/compare/branch sequence can be done in 1-cycle on x86. Of course this varies by implementation, and I’m sure that there are x86’s that can handle the compare and branch in parallel despite having a dependency chain on the flags, probably due to careful misalignment of the relevant pipeline stages. But there’s no way you can cmp on a loaded value the same cycle it’s loaded in. The L1 dcache’s 2 or (more likely) 3 cycle latency is going to get in your way. Sure, the pipeline can be busy doing other independent things in the meantime, but that’s always the case and I don’t think it’s what you were getting at.]

    You’re close: indeed the cmp happens many cycles after the load.  Meanwhile the branch predicts – and correctly >>99.99% of the time- and X86 O-O-O dispatch carries on past the branch.  As long as the load isn’t a full miss to memory the whole ld/cmp/jne sequence will retire long before it causes the X86 pipeline to stall, and consume perhaps 1 clock of decode/exec/retire work.

  • [AM: Interface dispatch is slower than double dispatch through a vtable.]

    Yes and nearly never taken.  I’ve yet to see interface calls show up on any profile of any interface-heavy crazy Java programs.  Inline-caches replace 99+% of all interface calls.

  • Try to write “XXX” in Java with the same speed.  In this case:http://rapidxml.sourceforge.net/manual.html.  In general, these kinds of comments are useless, because who has the time to do it?  In this case, it looks to be about a month’s worth of work (you gonna pay my salary?) … and entirely doable… except I’d come up with an entirely parallel XML parser so I believe time to parse could be dropped from roughly ‘strlen()’ on a long string to ‘strlen() divided by #CPUs’.  The thing these kinds of comments totally miss is this: plain ‘olde Java-looks-like-C code translates to machine instructions same as C does.
  • [TNT: Where does C# stand compared to the mentioned languages?]

    Not that I track C# all that closely but… I believe the JIT produces substantially slower code than Java; Microsoft leans pretty heavily on static compilers (and have a better statically-compiled-code integration story than Java does).  Also, Java’s Memory Model is well tested & well used; C# equivalent appears to be not so robust in part because of the requirement to run all that old code.  A real C# expert should chime in here, I’m not able to give C# a fair treatment.

    [IK: C# would land in the same bucket as Java.  It might be slightly slower because more money and man-years were put in JVM, or it might be slightly faster because of unsafe and structs and stuff.

    “Slightly” means “a few benchmarks would be ruined”. Like, I’ve heard, CLR did NOT do interface-vs-implementation profiling, thus some code gained 10x boost by replacing all occurences of “IList” with “List” (it couldn’t figure out how to dispatch calls to concrete List class really fast when the slot type was IList).]

    [PL: C# performance would fall into the roughly the same range as Java. I have a RETE engine written in Java and C# and the java version is faster. One area where CLR takes a performance hit is autoboxing/unboxing. Atleast that’s from my experience with my rule engine. Aside from that, I would say the performance difference isn’t significant.]

    Last round of anecdotal evidence I gathered (now 2 yrs old) showed Java JITs well ahead of C# JITs.  Would love to see some hard numbers here.

  • [CC: I still don’t understand why they do not cache the validated and optimized memory image for next time the application is launched. .NET can do this.]

    It’s a really hard problem for multi-threaded programs – which typically do parallel class loading – and lazy compilation -which typically inlines the classes that are loaded *at the moment*.  Re-using previously JIT’d code will require, amongst other things, that your code loading order is the same, and the last code-load order you JIT’d from probably depended on stuff like network packet arrival order.  Given that startup time for big User GUI apps is typically NOT the JIT (it’s the DISK), I personally have not been very motivated to try and do cached code optimizations.

    On purpose, I’m being sloppy in the numbers I report here… because I don’t want to spend my entire life beating this tired horse.  But if somebody reports something widely different from what I’m seeing, I’m happy to dig in further – if the reporter is also.  I don’t have access to every kind of compiler & system on the planet so I can’t repro other peoples’ results easily.  Also in my experience, the number one reason for conflicting reports here is because the reporter has something really simple wrong on their end and a short email session will clear it up.

    [WW: Here’s a hint. Next time you write silly code for comparison, make them do something more useful than integer operations and basic maths… those will always be the same (or close) between a compiled language and an interpreted one with JIT.]

    Sorry but I have a life outside outside endless blogging… and 100KLOC examples aren’t useful in a blog format anyways.

     


    String Hash Example:

    Complete C++ code:  http://pastebin.com/d280c1cd4
    Complete Java code: http://pastebin.com/m541c4655
    Here’s a bit of code computing string hashes that looks ideal for a static compiler (ala C/C++), yet Java is tied in performance.  I used fairly recent versions of ‘g++ -O2’ and ‘java -server’ on a new X86 (-server is the default for me, so no cmd-line argument needed).  The inner loop is:

     

      int h=0;
      for( int i=0; i<len; i++ )
        h = 31*h+str[i];
      return h;

    Here I ran it on a new X86 for 100 million loops:

    > a.out         100000000
    100000000 hashes in 5.636362 secs
    > java str_hash 100000000
    100000000 hashes in 5.745 secs

    Last year I ran the same code on an older version of both gcc & Java, and Java beat C++ by 15%.  Today C++ is winning by 2%… so essentially tied.  Back then the JVM did unrolling and “gcc -O2” did not, and this code pays off well when unrolled.

    [TD: In the String Hash example, is the unrolling done by the JIT or javac?]

    Done by the JIT.

     

     


    Sieve of Erathosthenes:

    Complete C++ code: http://pastebin.com/m3784c090
    Complete Java code: http://pastebin.com/m4b414295
    Here’s a simple Sieve of Eratosthenes, again compiled with g++ -O2 and run with java -server.  Again this looks ideal for a static C/C++ compiler and again Java is tied in performance:

     

     

      bool *sieve = new bool[max];
      for (int i=0; i<max; i++) sieve[i] = true;
      sieve[0] = false;
      sieve[1] = false;
      int lim = (int)sqrt(max);
      for (int n=2; n<lim; n++) {
        if (sieve[n]) {
          for (int j=2*n; j<max; j+=n)
            sieve[j] = false;
        }
      }

     

     

     

    I computed the primes up to 100million:

    > a.out      100000000
    100000000 primes in 1.568016 secs
    > java sieve 100000000
    100000000 primes in 1.548 secs

    So again essentially tied.

     

    [AJ: How do the sieve timings differ if you use an array of bits rather than of bytes?]

    Good question, but I bet they’re tied again.  The JIT does fine with bit-twiddley stuff.  Test it and let me know!

    [AL: Note that your sieve is not a true one: http://lambda-the-ultimate.org/node/3127]

    Cute!  Just swap ‘2*n’ for ‘n*n’… but this doesn’t change the C-vs-Java argument. 


     

    Profiling Enables Big Gains:

    Complete C code:
    vcall.cpp  http://pastebin.com/m70dbe7d6
    vcall.hpp  http://pastebin.com/m13055a8c
    A.cpp      http://pastebin.com/m5aa1b232
    B.cpp      http://pastebin.com/m2e46ec23

    Complete Java code:
    vcall.java http://pastebin.com/m149bbdf0
    A.java     http://pastebin.com/m2e33d6df
    B.java     http://pastebin.com/m2b1d75bb

    This bit of code makes a virtual-call in the inner loop of a simple vector-sum… and selects the v-call target based on a command-line argument.  The JVM profiles, decides there’s only 1 target and inlines … and unrolls the loop and discovers a bunch of simple math that collapses in the optimizer.  The C/C++ compiler can’t do it because there really are 2 possible targets at static compile time.  Delaying the compilation until after profiling can enable major optimizations downstream.  I compiled the C++ version with ‘g++ *cpp’ and the java version as ‘javac *java’.

     

        int sum=0;
        for (int i = 0; i < max; i++) 
          sum += val();  // virtual call
        return sum;

    Here I run it on the same X86:

    > a.out 1000000000 0
    1000000000 adds in 2.657645 secs
    > java vcall 1000000000 0
    1000000000 adds in 0.0 secs

    The Java code is “infinitely” quick: after JIT’ing the -server compiler essentially deduces a closed-form solution for the answer and can report the correct result with a few bits of math… no looping.  This example is ridiculous of course, but it points out the obvious place where dynamic compilation beats static compilation.  This “make a virtual call into a static call” optimization is a major common frequent optimization JIT’s can do that static compilers cannot.

     

     [QB: If your compiler can reason that the virtual function is pure, then the entire loop can be folded into single vcall and multiply. ]

     

    [ALJ: It’s worth noting that you can achieve the same effect – with the compiler turning multiple additions into a single multiply – by using C++ templates.]

    My example is perhaps too trivial; I can get the same performance benefits with a non-pure function (so no chance of using ‘pure’ in static setting).  In practice, you get dozens to hundreds of Interfaces turned into concrete Classes, and hundreds to thousands of calls to such optimized… using a code-cloning technique like C++ templates is going to blow you out with exponential code.

    [QB: Although initially expensive and not quite thread safe self modifying code will do the trick here. Alternatively, you could use existing dynamic recompilation techniques to make it thread safe, which I understand is probably veering off into dynamic compilation land…]

    Exactly inline-caches are thread-safe self-modifying code… but they still look like a call (inline caches make vtable calls as cheap as plain calls).  In this case the big gain comes from removing the call altogether, which means knowing there’s only 1 implementation of the virtual.


     

    I hope this clears up where I stand on this issue…  and I’m (sorta) looking forward to the flamefest which is surely coming…

    Cliff