A Brief Conversation with David Moon

  I had an email conversation between myself, David Moon & Daniel Weinreb. For the younger readers: David Moon is one of the original architects of the Symbolics machines, which 20 years ago more-or-less attempted the same sorts of things Azul is doing now; i.e. language-specific hardware support. I lightly edited the emails for clarity and ordering (otherwise the nested interleavings get horrendous to follow). 

[Ed: 11/20/2008 – Some small follow-on conversation has been appended to the end]

Cliff Click wrote: Mostly I felt inspired by your Symbolics work. Azul Systems makes gear for running Java (but actually it runs a C/C++ program – we just happen to run a large C++ that implements Java). One of the biggest impact changes we made was a hardware read-barrier for GC – a simple instruction that tests invariants of freshly loaded pointers and takes a fast-trap if the test fails. Fixup is entirely in software. It’s a RISC-y approach to the GC hardware that Symbolics had 20 years ago. 

David A Moon wrote: I remember this now; this was one of the things that impressed me when I read about your machine a couple years ago. As you may or may not know, the Symbolics 3600 had fairly complex GC hardware and the Symbolics Ivory went 90% of the way towards riscifying it, eliminating all the large hardware tables. I think there was still a 16×1 bit (NOT 16Kx1 !) table to indicate which portions of address space are oldspace; with 64 address bits instead of 32 you were able to reduce that to a single bit, which is nice. 

Cliff Click wrote: We use very simple 3-address 32-register 64-bit RISC cores, with 54 cores per die. All chips are cache-coherent and uniform (medicore) memory access time for all; the upper bound is 16 chips or 864 cpus. The “big” box is about the size of a dorm room fridge, 14U form-factor. The big box also has top-20 super-computer bandwidth (barely). Modest 16K I- & 16K D-caches for all cpus, plus groups of 9 share a 2-meg L2. Chips are fully interconnected. 

David A Moon wrote: 3-address 32-register 64-bit address, 32-bit instruction RISC does seem to be the “sweet spot” for instruction set architectures. [Cliff: Yes!] I hope you avoid needless complexity like condition codes, link registers, maybe even separate integer and floating point registers. [Cliff: Yes, yes & yes] I don’t think Java needs floating point rounding and trapping modes either. [Cliff: Yes, it does not] For your application you probably don’t even need separate user and supervisor modes since all executable code is generated by your just-in-time compiler from a safe language. 

Cliff Click wrote: That’s the theory but the practice is a little different: JVM’s crash and 1 crashing JVM should not bring down the whole box. So we in fact have a user/kernel split and it’s saved our butts any number of times. 

David A Moon wrote: Maybe you don’t need any kind of mode bits register at all? Being completely modeless would be cool. 

Cliff Click wrote: There’s also a speculative-memory mode: writes are buffered in the L1 but don’t become visible until a ‘COMMIT’ instruction. 

David A Moon wrote: That’s a pissload o’ cores. Are those 54 real cores per die? Or is it perhaps 6 real cores, each executing 9 interleaved instruction streams, in the style of Denelcor HEP? 

Cliff Click wrote: Nope, real cores all the way. 

David A Moon wrote: Does the precisely defined Java memory model give you any help in making a more simple implementation of cache coherency than is usually done e.g. for Intel x86 architecture? Or did that memory model come out after your hardware was designed? 

Cliff Click wrote: The Java Memory Model came first, and we designed the hardware around it. Actually the hardware implements something somewhat weaker than the JMM by default, and the JIT inserts memory FENCE ops as needed. More like Sparc RMO or Power’s memory model. 

David A Moon wrote: I assume uniform memory access doesn’t mean cached accesses aren’t any faster than cache misses, but means that when you do miss the L1 and L2 caches, memory access time is about the same no matter where the memory resides in the interconnection network. [Cliff: Yup] 

It must be a real challenge to get enough memory bandwidth to keep 864 instruction streams going at full speed, even more so with such tiny caches. 

Daniel Weinreb wrote: Moon says: “… but means that when you do miss the L1 and L2 caches, memory access time is about the same no matter where the memory resides in the interconnection network.” Actually, I heard a talk about a year ago, at the first New England Database Day, suggesting that the way information moves between the L1 caches of the cores is by a sequence of moves, each being up, down, right, or left, in the direction of the destination core.

Cliff Click wrote: No, we’re a full direct interconnect on the L2’s. 9 L1’s share an L2. One step to the L2, then one step to the correct remote L2 – across the box in any direction (including another L2 on the same chip; it was easier to run outside the chip and back in than arrange for a special faster-path for within-chip communication. 15/16ths of your L2 misses are remote and the gain for speeding up the remaining 1/16th of the misses isn’t worth it. 

Daniel Weinreb wrote: If so, the times would not actually be uniform. Taking advantage of such knowledge in the software would be challenging except for special-purpose domains, I would think. I don’t know how the 54 L1 caches (or maybe fewer if you’re doing the piplined business that Moon described) are interconnected, but there is a limit to how many can participate in a full crossbar NxN connection. [Cliff: Key word: challenging] 

Cliff Click wrote: We have a lot of memory bandwidth – one of our bigger boxes is on the top-20 supercomputer bandwidth list. The chips are fully interconnected. 

David A Moon wrote: Meaning every chip has 15 ports directly connected to another chip? [Cliff: Yup] That’s a lot of pins. You might want to look athttp://www.sicortex.com/media/files/the_kautz_digraph_as_a_cluster_interconnectfor an interesting slimmer approach. 

Cliff Click wrote: I’m sure we looked at sicortex at one point. 

Daniel Weinreb wrote: I agree about the challenge of the memory bandwidth for all those cores. Do you have very wide paths between the chips and the main memory? 

Cliff Click wrote: There are 4 memory controllers per chip (and up to 16 chips so 64 controllers). All memory is striped across all controllers at a fine grained level so each core’s memory requests are spread across the box – mostly eliminating “hotspots”. 

David A Moon wrote: Also they are a message passing architecture and you have to be a shared memory architecture, but I think that’s a separate issue from the interconnect topology. I am no longer sure which I believe in. On the one hand, I am a shared memory guy from way, way back. On the other hand, it is becoming increasingly expensive to pretend that memory is uniform and random-access, and the illusion is becoming increasingly unconvincing. The von Neumann architecture may be reaching the end of its life. I like the way the Cell treats main memory as essentially an I/O device, but I have never attempted to program it. 

Cliff Click wrote: By all accounts, programming the Cell is a nightmare. Since we can do shared-memory with so many cores, other people can too. The more difficult issue is parallel programming that uses all those cores. Shared memory is one way to make programming easier. 

Daniel Weinreb wrote: What about cache coherency over multiple chips? 

These days, as Moon has pointed out to me, the importance of caches means that you gain a lot by not looking at your address space as a sequence of co-equal words, but rather as a sort of block-oriented device (a little like the way one looks at a disk), in which the block size is the size of the cache line in some relevant cache (probably L2 but it depends a lot on the hardware if the hardware is novel). (See the word “illusion” in his mail.) 

Does your Java implementation specifically take this into account, e.g. allocating objects aligned on cache line boundaries, implementing collections as B-trees with nodes being an integral number of cache lines, and that sort of thing? If not, you might consider benchmarking it; I’d love to know the results of such a test. 

Cliff Click wrote: Our hardware has short cache lines (32b) so there’s less to be gained by aligning objects to cache lines. For certain benchmarks (not to mention SpecJBB or anything) such alignment is probably worth alot – but not for most use cases. Perhaps it makes sense to expose cache-line alignment to the Java/JDK programmer? 

Cliff Click wrote: Things we do specifically for Java:

  • GC read-barrier enables a fully parallel & concurrent GC; we can sustain 40G/sec allocation on a 400G heap indefinitely, with max-pause times on the order of 10-20msec. This uber-GC is partially made possible because of the read barrier (and partially possible because we ‘own’ the OS and can play major page-mapping tricks).

David A Moon wrote: That is indeed impressive! Even if I divide 40 GB/sec by 864 it’s still fast. Of course you need it, much Java code really likes to generate garbage all over the place. My pet peeve is no way to return multiple values from a method without generating garbage. 

Daniel Weinreb wrote: Dave, you say: “My pet peeve is no way to return multiple values from a method without generating garbage.” In Rich Hickey’s talk about Clojure(Cliff: a new Lisp dialect implemented on the JVM, mainly having been run on the Sun JVM but he ran it on an Azul at least once) addressed this issue. [Cliff: I’m familar with Clojure] He left out multiple value returns exactly on the grounds that the consing cost for a small returned value is so cheap these days, ephemeral GC’s being so damned good. I do not know to what extent he has done actual metering, though. 

Cliff Click wrote: Yes, annoying. And it’s embedded into the JVM bytecodes that way, so you can’t even rescue the problem by changing languages while keeping the JVM. The only hope here is for a combo of JIT’ing & smart memory allocation: either inline & optimize away the junk object or do some sort of stack-based allocation for the very-short-lived object. 

David A Moon wrote: Can you say anything about what page-mapping tricks are good for? 

Actually I am a little surprised you bother with virtual memory at all. If memory gets full surely it is faster to do a garbage collection than to swap pages out to disk (I think you don’t even have a disk if I remember correctly). Maybe page mapping is just to avoid core shuffling if there are multiple heaps? Is the page size large to cut down on mapping overhead? 

Cliff Click wrote: Ok, a bunch of things here:

  • Page sizes are 1Meg.
  • No swap; swapping is death for GC. If you are out of memory, then you are OutOfMemory. To put it another way: you are going to blow your performance guarantees so badly, the JVM might as well have crashed. Better to make the user allocate enough resources up front to keep the performance guarantees.
  • The page mapping is integral to the GC; we free physical memory earlier in the GC cycle than virtual memory. This lets us recycle physical memory much sooner, and makes it much harder for an application to “surprise” the concurrent GC – i.e. an app whose allocation rate suddenly accelerates might run a different concurrent GC out of physical memory – and force an application stall.
  • The mappings are also used to deny user-mode threads access to a page where the objects are being concurrently relocated, but still allow GC-mode access to that page. GC-mode is a subset of user-mode (so we sorta are like a 3-privilege mode machine), except that user-mode can promote itself to GC-mode anytime it wants.
  • The read-barrier will take a fast-trap on failure, and by default get promoted to GC-mode in the trap handler. This lets the faulting CPU fixup the object reference and continue, without needing to wait for GC to catch up.
  • We also have a write barrier, but it’s merely a slight efficiency hack. The read-barrier enables the good GC.

David A Moon wrote: I would expect the write barrier to be more than a slight improvement, unless your memory scanning is incredibly fast. The main thing the write barrier accomplishes is to greatly decrease the number of pages (or cards or other chunks) that have to be scanned for pointers to the youngest generation. 

Cliff Click wrote: Ahh…. here’s the difference: a ‘card-mark write barrier’ can be fairly cheaply done with plain-o-instructions, so the write-barrier instruction is a delta above that. The biggest change is our write-barrier op ‘filters’ card-marks that don’t need to happen which in turn cuts down on useless card-mark writes. Writes are generally really cheap – unless there’s a ton of contention caused by rapidly updating of an old-gen object holding a pointer into young-gen. In that case the filtering is worth a lot. 

David A Moon wrote: Since the GC is concurrent the time required to scan for pointers of interest to the GC shouldn’t slow down the application unless the application uses all the cores, except surely you run out of memory bandwidth eventually and then the GC’s memory scanning interferes with the application’s accesses to memory. So even with all that concurrency and massive processing power and memory bandwidth I would think you would still get a lot of benefit from doing less scanning because of the write barrier. But I haven’t measured it and you probably have. 

Cliff Click wrote: We’re definitely doing a generational collector, so we’d be using a software write-barrier if we didn’t have the hardware one handy. Or maybe: “We have a write-barrier. The fast-path is done in hardware. There is a slow-path done in software.” 

David A Moon wrote: I don’t know if this was published, but the write barrier on the Symbolics Ivory was really simple. Each page table word contains an encoding (in 4 bits?) of the age of the youngest generation pointed to by this page; the write barrier just causes a trap when those bits need to be updated. I don’t remember if we kept another index or just had the GC scan the page table. Since it’s an IBM-style inverted page table the page table is dense even when a lot of pages are swapped out so scanning it is not too slow. It would have worked better if the Ivory page size hadn’t been way too small. 

Did you ever look at Eliot Moss’s “train” garbage collection ideas to do ephemeral-GC-type optimization for long-lived data? If I remember correctly there is a bug: memory consumption can become arbitrarily large in the worst case, but the idea might still be good. I think Microsoft is using a form of it in .NET. 

Cliff Click wrote: HotSpot tried out the train collector years ago, and never got it to perform well in practice. There are a bunch of odd-ball issues, (remembered sets with popular objects?) that made it not very performant. 

Cliff Click wrote: Things we do specifically for Java (continued):

  • Simple 3-adr RISC is easy to JIT for; it’s HotSpot’s JIT and makes quite good code – easily comparable to “gcc -O2”.[David: Yeah.]
  • Just-In-Time zero’ing of L1 cache for allocation – and we don’t issue a memory read for the cache-line getting zap’d to zeros. This is worth about 1/3 less bandwidth than the usual “store a bunch of zeros in a row” implementations.

David A Moon wrote: dcbzinstruction on POWER. Clearly a good idea. 

Cliff Click wrote: dcbz doesn’t make the grade:

  • it’s not a prefetch so you can’t issue it far enough in advance without causing the stall you are trying to cover, and
  • any java lock that happens will invoke a fence, which will also stall
  • it does avoid the memory read bandwidth.

Our version gets all of the above right. 

David A Moon wrote: Dedicated register for the allocation pointer?

Cliff Click wrote: No. The main issue with allocation is avoiding the mandatory cache-miss as you stream through memory. You can hide tons of dumb int-ops and cache-hitting-loads underneath that cache-miss. So we pay a little bit of integer-cycles to setup our J-I-T-zeroing, and getting the allocation pointer is one of those cycles. In short: it’s too small picken’s to be worth holding a register for. 

Cliff Click wrote: Things we do specifically for Java (continued):

  • GC bits in each pointer (old-gen, young-gen, stack-allocated, normal-C-ptr) [David: I remember that now. Good.]
  • Java type-heirarchy bits in each pointer. This saves a trip to memory to fetch the Java type when making virtual calls.

David A Moon wrote: So you’re saying that since 64 bits is way too many, you can use a bunch of those bits to put a rather wide tag in each pointer? Maybe 20 bits wide? Seems like a great idea. No problem with having too many classes defined and running out of bits? 

Cliff Click wrote: Yes, wide tag per pointer. No problem (yet) with running out of classes. Big Java Apps these days seem to have about 2^15 classes. 

Daniel Weinreb wrote: I am curious about how you do threads. On the Lisp machine, the stack had to be contiguous in virtual memory. So whenever we had to allocate one for a new thread, we had to trade off between wasting VM versus worrying about running out of stack. I don’t know if we ever considered making the stack such that it could be a linked list of (big) blocks, but I suppose anything that slows down the function call path is very dicey. Dave, did you ever think about this? It would have been nice to have cheap full coroutines (“stack groups”). 

Cliff Click wrote: Nope – we’re using the fairly classic stack layout for HotSpot. And yes this means you have to decide up front the stack-vs-heap memory split. 

Daniel Weinreb wrote: Cliff, have you heard anything about the possibility that the JVM will be extended to be able to do tail-calling? Rich Hickey says that there was recently a “JVM summit” (something like that)of people writing non-Java languages for the JVM, and there was a lot of interest in tail-calling. It would be great for the Lisp world, since Scheme depends on it so much; it would be nice to re-unite the Scheme and Lisp communities as much as possible. 

Cliff Click wrote: I’m well aware of the interest in tail-calling optimizations – but it has to compete for scarce engineering resources like everything else. 

Cliff Click wrote: Things we do specifically for Java (continued):

  • Really big interconnect & memory bandwidth. Java has lousy locality, so things miss in cache – a lot.

David A Moon wrote: Would that be true even with 4 times larger caches? 

Cliff Click wrote: I think so. We did a bunch of studies on cache-sizes, looking at fewer-cores-more-cache vs more-cores-less-cache. 

Cliff Click wrote: Things we do specifically for Java (continued):

  • Support for 2 outstanding memory misses per cpu, up to 24 per L2 cache counting prefetches – so 2304 outstanding memory ops across the whole box. [David: “Non-blocking caches are good.”]
  • Minor tweaks to help do Java-specific math: array addressing is sign-extend, THEN shift-and-add – which we do in 1 alu op. Also we have a Java range-check op, faster virtual-call support, the IEEE754 subset needed for Java FP, etc. [David: Good.]

David A Moon wrote: Do you have any kind of specialized cache to avoid memory references to find the address of a virtual method to call? I haven’t measured it but I get the sense that all those not very predictable indirections through vtbls really hurt C++ speed. 

Cliff Click wrote: Yes… specialized hardware, not specialized cache. 

We do the standard HotSpot “inline cache”trick that’s done on all CPUs. 

The cache is a 1-entry cache, inlined in the code. The Key is the expected class of the object, the Value is the target of the function call encoded as a plain “CALL” instruction. So you do a Key-compare inline then a plain static call. If the Key compare hits the hardware will Do The Right Thing with the static call. In practice, 95% of all call-sites (that can’t be optimized by the JIT) never fail the 1-entry cache. The other 5% tend to fail for having lots and lots of targets. 

So there IS a “cache”, but it’s not specialized. It’s the normal HotSpot cache. Instead, we have an instruction to do the Key-compare in 1 cycle (and the call is another cycle). So we do cache-hitting v-calls in 2 clks & 2 ops. On a wide-issue O-O-O X86 that same sequence is maybe 4 dependent ops but issues also in 2 or 3 clks – except loading the object header is typically a cache-miss; but the branch predicts right and the O-O-O hardware lets the X86 proceed despite the miss. 

Cliff Click wrote: Things we do specifically for Java (continued):

  • Simple hardware transactional memory, used to speculate through Java locks.

David A Moon wrote: I’d be interested in reading more about that. 

Cliff Click wrote: We have a white-paper here.http://www.azulsystems.com/products/whitepaper/wp_otc.pdf

Cliff Click wrote: Ok, I’m done boasting. 🙂 

David A Moon wrote: It’s plenty worth boasting about. 

–Dave Moon 

Cliff Click wrote: Thanks, Cliff 

11/20/2008 – A little more: 

David A Moon wrote: Then I guess your thread scheduler is not aware of the partitioning of the hardware and doesn’t try to put related threads onto cores in the same chip. 

Cliff Click wrote: The thread scheduler is well aware of the hardware layout. It’s no mean feat to schedule 800+ cpus with 1000’s of runnable threads. 

Instead, nobody has a clue what counts as “related threads”. Certainly there’s no Java-level hint. You might imagine doing L2-to-L2 miss-rate profiling and try to decide which set of threads are communicating through memory instead of through an L2. It’s a non-trivial task for which there are some non-trivial academic papers out there showing modest gains for lots of hard work. 

David A Moon wrote: But you said your memory access time is mediocre. That’s the first symptom of the uniform memory illusion breaking down. Then you find that no matter how big you make your caches they aren’t big enough. Then you find that each new generation has slower memory, as a ratio of memory access time to processor speed. Then you find that programs are too slow unless they are written with detailed knowledge of the hardware memory structure. You probably know way better than me how far we are down that slope today. 

Cliff Click wrote: Split out the notions of ‘uniform’ from ‘fast’. Everybody pays attention to caches (or should): the mindset is “memory is slow“. But only some people pay attention to memory layout: the mindset is “memory is uniform“. 

For some parallel scientific programs, with well understood access patterns people can build both the machine and the program such that most accesses are “near” the CPU. In such a case it makes sense to build a NUMA machine: “near” accesses are faster than “far” accesses. 

Azul builds are gear as ‘UMA’ because our programs do not have well understood access patterns. Instead, the patterns are mostly random (after cache filtering) and so it makes sense to have uniform mediocre speeds instead of 1/16th of memory fast and 15/16ths as slow. 


Some Source Forge Threads on NBHM

A couple of interesting threads running on high-scale-lib’s forums I thought I’d post to a wider audience.



Josh Dybnis (jdybnis) – 2008-11-10 11:46
I wrote an implementation of the non-blocking hash map in C. I put it up at http://code.google.com/p/nbds/

Cliff Click (cliffclickProject Admin) – 2008-11-10 17:59
I never know what to make of such statements –


– if the table has “almost no synchronization” then it’s not non-blocking (but it might be highly concurrent). Non-blocking has a specific meaning to the academic community; if you say “non-blocking” you should also be able to say “no synch”. If you are based on NBHM then you need to provide e.g. a non-blocking malloc and a way to reclaim storage. Both are things NBHM uses GC for, and are not-trivial to get non-blocking. NBHM does indeed do very little allocation – so the blocking embedded in “malloc” will probably never hurt you.


– No “C” program can ever implement any reasonable concurrent program, because “C” has no memory model (and the proposed model is a major punt and not much better than nothing). Instead you can implement NBHM using “C using gcc_ver_XXX on X86_model_YYY”. If you change the “XXX” or the “YYY”, then all bets are off, and the program can fail. In short, the “volatile” keyword in C does not have nearly the semantics as it does in Java, and C compilers routinely do things with “volatile” variables that Java does not allow. Changing the C compiler version or the underlying hardware can break things.  


Despite my complaints a C version of NBHM on X86 is surely a very useful and non-trivial thing.

Josh Dybnis (jdybnis) – 2008-11-11 01:12
Thanks for the feedback. I really appreciate you sharing the algorithm. It’s one of the most elegant pieces of work I’ve come across in a few years. 


– In theory, I agree 100% with you about writing concurrent programs in C. And people need to be clear about the limitations before attempting such a thing. In practice I think it is reasonable to do “porting” work for each platform/compiler. My implementation is targeted to gcc 4.x on 64-bit x86 (Intel does specify a memory model in their Software Developer’s Manual). I think it could be ported to most other architectures without major modifications.


– My implementation includes a non-blocking malloc() and a method to do safe memory reclamation. 


– My statement about “almost no synchronization” was badly stated. I was referring to my malloc implementation and not the NBHM itself. Also I was also using “synchronization” to refer to x86 operations having the “lock” prefix, not any non-blocking property of the program.


– A couple of more caveats about my malloc and the safe memory reclamation. They aren’t the most robust implementations. They need a bit of work. But they are non-blocking, and should be very low overhead on systems without too many cores (say less than 32). The malloc does call mmap() from multiple threads, it could block in there. So I guess technically it is not “non-blocking”. But that is an artifact of the implementation. In theory one could use an asynchronous method of invoking the kernel to make it truly non-blocking. 




By: Cliff Click (cliffclickProject Admin) – 2008-11-11 07:41
How do you know when e.g. the NBHM Big Array is free?

By: Josh Dybnis (jdybnis) – 2008-11-12 17:06
>How do you know when e.g. the NBHM Big Array is free? 
Missed this earlier. 
After I promote a new array the old one gets queued up for a deferred free. (The same with keys in the old array that are tombstoned and were not copied to the new array.) The frees actually happens at a later point when all the threads are guaranteed not to be holding references to the memory. I accomplish this using a technique from RCU. Each thread that might be holding a reference periodically announces that it is in a quiescent state. I have a way of detecting when every thread has made such an announcement. At that point I can safely free the array. My implementation of all that is not particularly robust. If a thread blocks, and never announces it is in a quiescent state nothing more will ever get freed. This is straightforward to fix with a more complex implementation. I’m having an on-going discussion about it on comp.programming.threads. http://groups.google.com/group/comp.programming.threads/browse_thread/thread/702aedc141b34fa1#


Also, one could also use GC in a C program too. Or one of the safe memory reclamation techniques documented in the academic literature (e.g. hazard pointers, smr, etc.). 

By: Cliff Click (cliffclickProject Admin) – 2008-11-12 17:21
Sounds good. Just was wondering.
Lack-of-GC generally brings about various broken attempts to free memory in concurrent programs.



By: Cliff Click (cliffclickProject Admin) – 2008-11-11 13:52
> In theory, I agree 100% with you about writing concurrent programs in C….. I think it could be ported to most other architectures without major modifications. 

Itanium. 🙂


By: Josh Dybnis (jdybnis) – 2008-11-11 19:47
Yeah Itanium, the Alpha would be even worse. It is your NBHM algorithm though; I don’t think there is anything dependent on having a strong memory model. You would just have to put memory barriers in the right places. I’m not saying that it would be easy to figure out the optimal solution, but it wouldn’t require any major rework to the code. The lazy way to do it would be to change all the CAS calls to include a full memory fence. Doing it that way would constrict performance of course, but it would be correct, and probably still more scalable than a traditional lock-based hash table.

By: Cliff Click (cliffclickProject Admin) – 2008-11-11 21:25
The NBHM State Machine is “correct” independent of any memory model. I still need fences & a Memory Model around the promotion-logic & table-copy areas. It’s here with an IA64 would cause you grief. If you didn’t need table-copy (i.e., pre-made fixed-sized table), then it would be correct on any shared-memory machine. 

The NBHM as presented as stronger semantics than a minimalistic non-blocking hash table – Keys are treated with Java volatile semantics, so e.g. you can make & install a fresh Key on 1 thread, and have a 2nd thread see the Key from the table and also see the Key’s contents so it can correctly run a Key compare. A naive C IA64 implementation can screw up there, although the X86 strong ordering saves it. Same issue happens with the returned Value (made in Thr1, returned in Thr2, but does Thr2 see the initialized contents of the Value?). 

Here’s a sample IA64 table-copy screwup:

  • – T1 copies last K/V from old table to new table.
  • – T1 incrs the last count of copied values. Bug: No ordering between the stores.
  • – T2 sees the last count (read of T1’s 2nd store), 
  • – T2 promotes (stores new table as default)
  • – T3 sees the new table (read of T2’s store)
  • – T3 probes for K, but does not see T1’s store.
  • – T3 reports K as not-in-table, when in fact it is.




By: Josh Dybnis (jdybnis) – 2008-11-12 16:36
What happens if a thread copies a slot and then dies before updating _copyDone? Does the copy never get promoted?

By: Cliff Click (cliffclickProject Admin) – 2008-11-12 16:39
The fall-back case is that some thread (all threads?) scan the entire array and discover that all things are copied, despite any failed counts. At this point the “discovering” thread can promote. 


By: Josh Dybnis (jdybnis) – 2008-11-12 17:15
I’m missing where that is in the implementation. I see where a thread enters panic-mode and does the scan, but it looks like it always compares its workdone with _copyDone and the size of the array. So if every slot is copied, but a thread didn’t get around to updating _copyDone, none of the panicking threads will ever terminate their scans.

By: Cliff Click (cliffclickProject Admin) – 2008-11-12 17:23
Welcome to the difference between algorithm and implementation. 🙂

Feel free to post a fix. It should be a 1-liner.

By: Josh Dybnis (jdybnis) – 2008-11-12 17:44
Heh. I have a similar bug in my implementation.




compareAndSwap memory fence (New)

Andrew Trick (andrewtrick) – 2008-11-12 11:16
My interpretation of your slide-set/blog comments is that the NBHM state machine–ignore table copying for now–does not require any memory fences or store ordering. CAS_key, CAS_val gets the job done as long as your machine has atomic compare-exchange. Correct?
In fact, in NBHM source you bypass util.concurrent.atomic, calling Unsafe.compareAndSwapXXX presumably to avoid the memory fence. However, the Sun JDK and HotSpot assumes that Unsafe.compareAndSwapXXX enforces a full memory fence. So would it be fair to say that you’ve made Azul-specific changes to the JDK/JVM to optimize NBHM? Should Sun get rid of the full-fence semantics of Unsafe.compareAndSwap before more Java/JDK code begins to rely on it?

By: Cliff Click (cliffclickProject Admin) – 2008-11-12 11:20
Yes, I need no fencing (ignoring table copy) for some kind of hashtable semantics. You DO need very carefully placed fences to get CHM semantics – which amount to ‘volatile’ semantics on values used in put & get calls.
I bypassed the Atomic classes to get the no-fencing CAS. The Unsafe calls specifically have weak no-fencing versions. On an X86 & Sparc you are stuck with the fencing anyways, but on Azul the weak no-fence version actually does not fence.
So I did not do any Azul-specific changes to the JDK here (Azul’s JDK IS different from Sun’s – but not here).

Andrew Trick (andrewtrick) – 2008-11-12 19:05
Thanks for the reply and clarification.
The JDK does indeed have AtomicReference.weakCompareAndSet. Sorry to be pedantic about API details but… What I meant by “JDK assumes a fence” is that AtomicReference.compareAndSet for Sun is identical to weakCompareAndSet, so the underlying Unsafe.compareAndSwapObject must have a fence. What I meant by “HotSpot assumes a fence” is that the Unsafe.compareAndSwap intrinsic generates abstract memory barriers, which you would have to materialize as a real barrier on any machine that doesn’t have implicit CAS fences.
But really I was just trying to understand the intention behind your NBHM source, which is pretty clear now. It’s not important whether some careful JVM porting is needed to make it fast on certain h/w.



By: Cliff Click (cliffclickProject Admin) – 2008-11-12 21:35
– sun.misc.Unsafe does not specify any memory ordering properties for compareAndSet.
– AtomicReference.compareAndSet is doc’d to have volatile semantics, and operates on a volatile variable.
– AtomicReference.weakCompareAndSet is doc’d to NOT have volatile semantics.
A legit JDK implementation could then
– not fence for sun.misc.Unsafe
– but yes fence around volatile ops, including Unsafe ops on volatiles.
This is exactly the intent of the docs & implementation, and it specifically allows Azul to not fence on sun.misc.Unsafe & still meet the spec – which we do (on both fronts: not-fence and meet-the-spec). We also typically do not suffer from other peoples’ bad assumptions here – where they expect Unsafe to fence and are surprised when it doesn’t. I suspect Unsafe.CAS users are a rare breed.
Removing the fence for Unsafe ops is a tidy win for Azul on some highly parallel contention-heavy codes.




That’s all,