Interesting conversation(s) with Doug Lea

I had a set of long email exchanges with Doug Lea, reprinted here (with permission).  Lightly edited for clarity and the rare NDA-only material.  I resorted several of the interleaved conversations to make them flow better – but the original emails are highly interwoven.

 


We are discussing the merits of a no-fence CAS instruction – and exposing this instruction at the Java level.  Since Intel does not have a no-fence this whole idea is a no-op for Intel.  Right now there is a weak spurious-fail-allowed CAS in the Unsafe to mimic load-linked/store-conditional ops, to get buy-in from IBM whose chips do LL/SC instead of CAS.  Doug wants to the fence-less CAS to allow spurious failure, and I do not because it causes me to add retry loops in places.

Cliff: 

Our only ordering constraint is data-dependent loads are ordered. For volatile stores we have to st/st fence (and ld/st fence – one instruction for both).  For volatile loads we have to ld/ld & ld/st fence.

Doug:

Not sure I buy the completely fenceless concept.  Without at least a st/st, what keeps a compiler from pushing all the calls in a loop to after the loop? Or even before the loop?

Cliff:  
Are you asking: with a fence-less CAS, why cannot the compiler just push it around willy-nilly?

Example before optimizing:
  while( long_time ) {
    ...stuff...
    no_fence_CAS(_counter,_counter+1); // increment perf counter with no fence
  }

 

Example after optimizing:
  while( long_time ) {
    ...stuff...
  }

  no_fence_CAS(_counter,_counter+long_time); // increment perf counter with no fence

Which is a pretty reasonable hack for performance counter work.

 

My no-fence-CAS-for-perf counters can be done on Intel with a ‘locked:add’ – except that locked:add has more ordering behavior than perf counters need.

Doug

If you think this is OK, then I believe you. but I was thinking 99% of usages where you’d like the counters to be updated inside the loop. (If only because you lose big if that one trailing CAS fails).

Cliff

Perf counters, not ‘unique id generation’. Perf counters have the property that they need to be statistically ‘close’ to reality, but not spot-on.  If you just do a volatile-variable ‘vol++’ you drop lots of counts under high load.  Easily 99% of the counts.  So you need to do an atomic update, or intel locked::add.  But there’s no requirement for ordering or perfection even.  Dropping 1% of the adds is fine.  You just need to not drop 99% of the counts.  For setting one-shot monotonic flags, it’s really the case that I need ordering, but I’ll get ordering elsewhere.

Doug

OK. If using CAS, you need a loop, which is unmovable anyway so fences don’t buy you anything. So fenceless CAS is OK here. But so is fenceless weak CAS. Which still keeps me thinking that pairing fenceless with spurious-fail is fine? 

Cliff

Fencing  – Fences cost hugely more than a run-once or run-twice loop.  Any time I can say “no fencing please” I want to.

Spinning – Under high load, my perf counters will fail their CAS once – 2nd time it’s hot in cache.  So I *should* spin my perf-counters and not fence them, but I rarely do (*some* I do, depends on context).  In the miles and miles of JIT’d C1 code, all those perf counters I do not spin-loop, I just lose the counts.  Too much i-cache pressure, too much slow-down, not enough gain.  They are rarely *highly* contended (except in a context where I’ll shortly C2 JIT anyways). 

 

Hot-locks I always CAS-loop; the ‘loser’ of the lock has cycles to spare because he can’t get the lock anyways.  He might as well tough out a cache-miss and bump the counter proper before going to sleep.

Spurious Fail – Spurious fail costs me a loop anytime I need a guaranteed CAS result (so not perf counters) and a 1-shot no-fence CAS would have done the trick – which is mostly limited to 1-shot inits.  Turns out I have lots of those; every method that can be called from compiled code needs a “C2I” adapter – these are racily made in parallel from all threads hitting this path at once (can be hundreds if a whole thread pool switches to a new task flavor) and they all do a 1-shot no-fence CAS.   Losing C2I adapters are reclaimed by GC later.  After that CAS they do a volatile read of the same field and jump to the address loaded.

 

I do C2I adapters this way, plus MethodStubBlobs (replacing HotSpot’s ICHolders), plus I2C adapters, plus CodeProfiles, plus native wrappers, plus… in short, I got lots of 1-shot inits that I racily make in parallel, 1-shot CAS install, then re-read with a volatile read.  Since I’m generally only after the data-dependent results of the 1st read, my ‘volatile read’ fence can nearly always be a NOP (except perhaps on Alpha or IA64). 

 

Now, this is stuff that’s all in Azul’s HotSpot and not written in Java code so maybe it can be taken with a grain of salt…. although NBHM clearly had places where a 1-shot no-fence no-spurious-fail CAS would do the trick, but I was stuck with either looping or fencing from the Unsafe spec.  I choose looping but I could never test the failure backedge because I never ran on a system where the hardware CAS would actually fail (i.e. I never ran on a Power system with LL/SC).

Doug

Then why not always JIT on CAS failure?

Cliff

’cause I’d have to use a CAS & test the result.  Right now I do a plain ld/add/st.  I’m talking C1 JIT’ed profiling here… once the counts hit 10000 I kick in C2.  If C1 is dropping counts for being highly racey & under load, I’ll still hit 10,000 real quick – I’ll just lose some sense of *relative* hotness between different leaky counters.

 

But for Java Atomic::addXXX flavors, yes the Right Thing To Do is to re-JIT on CAS on failure.  I’ll have to look and see what’s being emitted by C2.  I suspect the loop backedge is listed as being rare but not non-zero.  I probably should do a Heroic Optimization thingy and make the backedge zero until the CAS fails, then record the failure & re-JIT.  Not hard, but not non-zero.  Hummm…. it won’t really help the code either, because that re-try loop doesn’t hurt any other optimizations really.  It’s not like I’m really removing any interesting code from the processor’s hot-path, nor removing any optimization-blockers from the compiler.

Doug

*Spurious Fail* – I’m still skeptical about this, but I suppose it wouldn’t hurt to include compareAndSetInReleaxedOrder to calm you down about it  🙂

 


We roll into a discussion of Fork-Join behavior and leaky compile-trigger counters

Doug

“If C1 is dropping counts for being highly racey & under load, I’ll still hit 10,000 real quick – I’ll just lose some sense of *relative* hotness between different leaky counters. ”  I’m not so sure that these work out as well as you think.  I see surprising patterns in when various methods are compiled in FJ.  If you are interested in exploring some of the effects here, grab current jsr166y.jar

 

http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166y.jar

And run either of the attached simple fork/join programs with +PrintCompilation or whatever set —

  java Integrate NCPUS  {d, f}

  java LeftSpineFib NCPUS 45 5

For Azul, probably 96 is good test. As in “java Integrate 96 f” (the “d” is for dynamic partitioning, which is usually about 2X faster, but also often shows different compilation patterns. Dunno why).

Cliff

For Integrate: the #1 hot method at 85% of all ticks is “Integrate$FQuad.recEval(DDDDD)D”.  It has no inlining, nor would any inlining decision change help.  Here’s the profile with a 40Gig heap.  Note that you’re allocating 4Gig/sec here, so GC remains an interesting cost.  All other inlining decisions do not matter – they have essentially no impact on the ‘score’.  So I claim that dropping counts here makes no difference.  You have an extremely simple & dominate profile.  I’d get the same code for the hot method if I dropped 1% of the counts or 90% of the counts.  There’s a lot of FP math in there, almost no control flow, some allocations, some fences that run in 1 clk (so I suspect they are final-field fences).

Percent Ticks Source
86.1% 225,836 Integrate$FQuad.recEval
12.8% 33,435 new_tlab (vm stub code)
0.4% 1,083 SharedRuntime__new_heap (vm code)
0.2% 603 VM_JavaCode (vm code)
0.5% 1,187 Other

 

For LeftSpineFib the story is different.  There’s 2 hot methods, LeftSpineFib.compute and .seqFib. I got them inlined in one run and not in another.  When not inlined, there’s a lot of allocation and performance is ~1.3sec per iteration.  I only got the one time inlined, and it appeared to run faster and not allocate very much.

 

Nonetheless, there appears to be only 1 interesting inlining decision to be made here. All other decisions are “chump change” – I only need to get this one decision right.  I’m not clear what the issues are without a full-blown debugging session.  It would help if I could make LeftSpineFib run longer (say 10’s of minutes) – it’s all over in a minute or less so it’s hard to get a good profile, or look at the compiler’s inlining decisions.  It’s clear from the profiles that the compiler *ought* to do the right inlining: call site counts are in the 100K’s within a few seconds.

Doug:

Happy you could replicate at least with this one.  You can make this run longer with larger values of fib (2nd argument) Try a larger value than 45. It is 1.6X slower for each added value. So maybe 49 is good. The answers will be wrong because of int overflow past 46 though.

Cliff:

Switch to Long math – same speed on Azul & x86/64

Cliff

You are a tragic victim of profiling a recursive function.  I don’t get separate profiles for layer of the recursive depth and C2 hasn’t a (real) clue how many times to inline.  Recursive functions are not common In Real Life.  So some static heuristics kick in after awhile.  If you are desperate to prove that non-leaky counters will change something here (they won’t, your profile is highly regular) you can rename the recursive functions seqFcn_1, seqFcn_2, etc for as many layers as you get bored.  Then I’ll get separate profiles for each layer and I claim C2 will inline “enough”.

Doug

“Recursive functions are not common In Real Life”.  That could change if FJ gets used a lot to parallelize loops and aggregate operations (as Clojure already does). So you might want to think about this before it happens?  It’s reasonably far down the priorities though – much more important are things like the loop dispatch problems we discussed a few times.

 


A quick diversion into a bug which turns out to be mine.

Cliff:

fyi… when I run in various hotspot debug/slow modes I can trigger a situation where the FJ framework spawns threads endlessly.  I generally kill it after I get 10,000 or more threads…

Doug

For which of the programs do you see this? Any guesses about how I can try to replicate?

Cliff:

java Integrate 24 f – and a debug 64-bit HS build

Doug

I can’t replicate using the current fastdebug build at

http://download.java.net/jdk7/binaries/

on 4X2way AMD or 2Xi7 Intel. I guess I should try to build my own debug build. 

(Aside: they seem to have changed layout of fastdebug releases without changing instructions. I wasted an hour or so before just moving fastdebug .so’s into jre/lib/amd64/server for a fresh install, which seemed to work. Sigh.)

Doug:

I just did (current openjdk sources) and still can’t replicate on machines here?  When you did this, were the #threads lines printing large values, or did you just observe a lot of threads? Any other hints?

Cliff:

Lots of threads being created, and no progress on the app front.

Doug:

I take this seriously because there are some delicate invariants here wrt thread counts, which I’ve convinced myself are right, but informal proofs are no match for reality. But I’m currently thinking this is some debug-build hotspot oddity.

Cliff:

But this is with my build which itself has some interesting lock/wait/notify designs.  Maybe I can send you some stack traces of the thread that is busy making threads.

 

Cliff

[I later find the bug in Azul’s locking; related to an unfortunately timed lock deflation caused by rapid GC cycles.  Fortunately it’s an engineering-prototype only bug.]

 


We roll into a conversation about *when* to JIT.  Or really: Doug is asking for special JIT treatment of ‘master’ threads which control the launch of other threads, and I’m claiming he’s fallen victim to unknown perfrmance bug.

Doug:

While on the subject of compilation triggers, here’s a reminder of another old topic: That counters aren’t too useful in detecting need for compiling code that may not be invoked all that often, but when it is, enables a bunch of other threads to run. So speeding it up by compiling can have dramatic impact on aggregate throughput. The main examples are code with notify{All}, unpark, as well as many uses of simple CASes  (to release spinning waiters). It would be swell if code with such calls got some extra compilation points.

 

This hits especially hard in FJ, where ramp-up time to wake up all the workers can take longer than the actual computation. To improve things a little, I manually inline accessors, Unsafe calls, etc., in some of that code so that it runs enough faster even when interpreted to make a noticeable difference. If you are curious, the main points are ForkJoinPool signalWork and preJoin. For some explanation, see the internal documentation at

http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166y/ForkJoinPool.java?view=log

Cliff:

This doesn’t make any sense – unless the counters are leaking badly.  After 10,000 hits, it gets compiled.  Assuming not-leaky-counters and 100 threads a mere 100 trips through and you get C2 compiled.

 

“I manually inline accessors, Unsafe calls, etc” – I saw all these methods get JIT’d fairly early on, and after JIT’ing they disappeared from the (TXU/Vega) profile.  

 

(ed: Cliff Notes – added during blogging and not in the original email exchange – I think it a colossal JIT failure that Doug Lea is hand inlining; and now I am personally determined to get to the bottom of this)

Doug

Azul hotspot is then apparently tracking counts better than Sun hotspot?

Cliff:

Perhaps.  Also, if 1000 threads are runnable on a 4-core X86 and if the thread priorities are not properly setup and honored, then the JIT threads have to compete with the runnables and typically lose badly.  This was always a bad issue with VolanoMark until we finally thread priorities set right.

Doug:

Hopefully people don’t do that — parallelism should normally be approx #hw threads.

Cliff:

Sorry to disappoint.  I’ve seen:

– Fixed.  Always much smaller than what an Azul box needs. 

– Small constant * num_procs (small=1 to 4).  Always much larger than what the code can scale too on an Azul box. 

– More threads added as external event happens (or not).  Frequently causes endless thread-create. 

– Rarely, threads scale till either the limit of CPUs or load or locking events dictate.

 

A couple of big popular app-servers either do-it-right or can be tuned. 

Many home-grown solutions fall over here.

Cliff:

Now, if instead we have a leaky counters and the nature of FJ is to make the leaky-ness much worse (because the threads all bunch up in the same code at the same time).  Suppose of the 100 threads only 10% or 10 ‘counts’ stick   Then it takes 10 times longer to reach a point when we start JIT’ing.

Doug:

Well it does still make sense regardless of leaks  🙂 

It is possible for a single slow path that releases many threads to arbitrarily slow down throughput, and your counters would not even be able to tell. And sometimes not only possible but common. 

Cliff:

No?  It should not be possible.  After 10000 *total* trips through the code, across all threads, it should JIT.  So either you don’t hit 10000 trips through the code – which means max possible slowdown is 10000*(JIT_speed-interpreter_speed) – which will be measured in milliseconds – or else you hit more than 10000 total trips and it gets JIT’d.

Doug:

But you have to apply Amdahl’s law here — the other threads don’t run at all until woken up. Which can be roughly approximated by multiplying your slowdown calculation by #worker threads, which is normally #hw threads.

Cliff:

But… the slow wake-up thread only has to do 10000 wake-up events, then it’s JIT’d.  So 10000 times threads are slow to wakeup – e.g. costing an extra few micros per – so total app time stalled by 0.01 to 1 sec.

 

If you’re still seeing slow wakeups then there’s something else going on.  I’m totally willing to believe in stampede issues or dropped count or slow/missing wakeups or etc… but not lack of JIT’ing…. unless Sun has a bug here that Azul does not.  But I claim it’s a straightforward bug and not anything that needs special treatment.

Doug:

The wakeups normally happen only a few times per pool submission (i.e., the root of a tree of recursive tasks). So waiting for 10000 is not very nice. Knowing this, I invoke some of the wakeup code in places that hardly ever actually need to wakeup threads but at least cause that code to compile sooner.

Cliff:

But again, why can this ever matter?  Total time the app is slowed down is <1sec.  The initial task is slow to get kicked off… but not that slow, and all the remaining threads run/launch at JIT’d speed.

Doug:

I suppose that conceptually you could counteract this by increasing count by #CPUS instead of by 1 when you see notify/ unpark/ CAS?

 

More generally, slow-wakeups (not just because of not being compiled, but because unpark can be slow) is sometimes the biggest problem I see in FJ. It is easy to check that this is so by just commenting out park calls and letting them spin. Which of course I can’t do in production code. There are a lot of little things I did in a big refactoring pass a few months ago that each help this a little, so it is not currently a huge problem, but everything I can do will help a little more.

Cliff:

Yeah – here we’re talking Real Meat.  Park can be slow – or not – at the whim of a lot of factors utterly unrelated to the timing of JIT’ing.  The OS can screw up, the JVM’s lock-fast-path can screw up, the JIT can fail to inline, the notify/wait/wakeup path can screwup, hypervisors can screw up (you are *not* trying this under a hypervisor, right? – No chance to get it right under VMware until it’s Right without it), etc.  Also, spinning beats Parking if you got cpus to spare.

 

Really I suspect you’re seeing something not related to the lack-of-timely-JIT’ing.  I totally can believe the hand-inlining stuff, etc is paying off… but it’s not because the basic design needs tweaking.  There’s something else busted.

Doug:

One place to wonder about is interaction with GC. FJ spews garbage, but it is for the most part “nice” garbage. Typically 99% of tasks are not stolen, so die quickly, so there is a lot of early-generation/nursery level GC.  Relying on generational GC rather than expensively manually managing memory for tasks is the main reason we can outperform cilk and other C/C++ workstealing frameworks. But in exchange we sometimes do get GC stalls at very inconvenient times, that lead to postponing setting status fields, counts, or performing wakeups, all of which could cause other threads to continue.  Which suggests finding ways of discouraging GC at some of these points. Which I guess amounts to reducing non-inlined method calls or back-branches in a few critical places. Probably the manual inlining etc I already do helps here, but I should probably try to pinpoint other cases.

Cliff:

Interesting… also there’s interactions with non-STW GCs.  In a STW GC, everything halts except time (so System.CTM & CTN roll on but no progress in work, so heuristics get wedged).  In a non-STW threads halt more independently & randomly (but hopefully for shorter time period).

Doug:

(Also, on thinking what I wrote about spins vs park yesterday, I’m led to try a few more adaptive-spin constructions.)

 


Now we’re look at hacking algorithms to better fit the GC (Doug’s choice) or hacking the GC to better fit the algorithm (my choice).

Doug:

While I’m at it: One reason GC is a little slower than it ought to be is that I initially size the workstealing array spaces to 8192 (which is MUCH bigger than they typically need to be — 64 is always enough for plain strict FJ programs) to avoid cardmark contention (plus plain cache contention) for writes of tasks into the slots. On x86 especially, eliminating this contention much more than makes up for increased useless GC scanning costs, but probably hurts on Azul (it is on net about even on Niagara).

Cliff:

Nah, just for You, because You’re Special, we (Azul) filter cardmarks.  🙂

 

On Azul hardware, the hardware filters between generations – so only storing a young-gen ptr into an old-gen object goes into the card-mark slow-path.  In that path we do a read/test/store.  Repeatedly marking the same card only does (cache-hitting) loads, no stores.

Doug:

G1 supposedly reduces cardmark contention too, but it still is much too slow to use for FJ. 

It would be nice to have some sort of #ifdef for initial sizing.  (See that P-Ray mail for a prod of other kinds of config information I’d like to have available…)

Cliff:

I *claim* (without a shred of evidence) that our GC will blow away G1 and not need any configs.  At least, we’ve never needed any GC configs on Azul gear.  Really we claim it’s a bug if the GC doesn’t just “Do The Right Thing” right out of the box.

Doug:

But this case (sizing array of refs to 8K vs 64) is already a workaround of sorts — trading slower GC scan times for less contention. On VMs that don’t have these cardmark contention problems, I’d rather not have users pay the price of slower GC and bigger footprint for no benefit.

Cliff:

Ahhh, you are changing algorithms to help with different kinds of GC.  I *claim* (again without experimental evidence) that Azul’s GC is utterly immune here and doesn’t care if you pick 8K or 64.  Certainly an extra 8K vs 64 ref array isn’t going to matter.  And we filter cardmarks.

 


We get closer to the issue with Doug’s thread-launch-rate problem

Cliff:

Also… I see the high allocation rate on your FJ programs you’ve given me, but nothing at all about the various slow wakeup issues.  After the first 2 or 3 numbers report back (in 2 or 3 seconds), then “Its All Fast”.  Everybody is working hard, etc.  This is with both a native Sun HotSpot “-d64 -server” and Azul.  Can you send me a program which shows off the bad behavior on a plain Sun HS?  What exact gear shows this, which rev of HS, etc?

Doug:

This was harder than it seemed — all those small improvements I did a few months ago have made some of clear-cut cases I had in mind less clear-cut on x86. Some remain clear-cut on Niagara though, which I’m now guessing has more to do with GC/safepoint checks than I had thought (less inlining on sparc -> more safepoint checks -> more GC where I don’t want them?)

Cliff:

Known bug/issue w/Solaris – thread starvation.  Suppose you launch 32 threads on a 32-way Niagra.  Java threads of course.  The concurrent GC spawns a happy 20 more VM threads, plus some JIT threads.  Now your 32 workers take all CPUs, and the GC “panics” because the initial heap is small and the initial allocation rate is high – so it both bumps the heap AND spawns emergency full-speed background GC work.  The poor JIT threads starve awhile until GC gets a bigger heap from the OS and stops panic’ing….  happened to us on VolanoMark all the time, till we fixed both the JIT threads priority (they MUST make progress against the Java threads) AND the GC heuristics.  Could totally be a Sun/Solaris thread priority bug here.

Doug:

Playing around with this, I think it might play some part:  On Niagaras, I get faster compilation using a few less than #hw-threads workers in pool. Not a huge effect though.  I just saw a bigger effect with some exploratory hacks to more often avoid wait/notify mechanics in join. See the internal documentation in ForkJoinTask about why I need to use these rather than raw park —

http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166y/ForkJoinTask.java?view=log

So it may be the case that I wasn’t avoiding bias etc effects as much as I thought I was. And that I’m seeing them more on Niagaras because these machine run with a lot more hw-threads -> parallelism level than my x86s (128 for Niagara2 vs 16 for my biggest x86). This deserves more experimentation.

 

All in all these are in keeping with my sense that there is not one single big reason for sometimes-slow ramp-up, but lots

of smaller effects. Many of which I can actually do something about, which is always better than feeling like I’m at

the mercy of you JVM-level folks 🙂

(Albeit with occasionally sleazy practices like the above array sizing.)

Doug:

Anyway, I’ll try exploring this a bit more before trying to conclude much about it. In the mean time, I attach a few more tests in case you feel like looking into this more.

Cliff:

on retrospect… [ed. and some days later] see if the issue is that the launching/controlling/master thread is going to sleep.  Common OS situation: you just fired 17 threads up; they are *start* on the same CPU before spreading out to their (idle CPU) neighbors.  Along the way the master thread gets kicked off for awhile but a worker and doesn’t get a chance to keep launching threads.  Azul paid close attention to this issue: we get 100’s of threads running in a handful of milli’s – common enough in various HPC-style barrier-compute-barrier apps – but it’s hard work to get right.

 

Doug:

Been there. Pretty much under control. Here’s a paste of main explanation:

     * 4. Managing idle workers waiting for tasks….

…The net effect is a tree-like diffusion of signals, where released threads and possibly others) help with unparks.

 


Now we’re into biased-locking effects.

Cliff:

Naturally we’ve completely re-done locks so I suspect your hacks are all for naught. 

 

I think you’re looking at performance-unstable algorithms.  i.e., performance is very sensitive to conditions that you haven’t nailed down yet.  Things like whether or not some particular lock inflates early or late (more overhead after inflation if a thin-lock would have worked), or a badly timed GC stalls some threads while others get to run on, or your work load becomes unbalanced, or sometimes a shared queue gets hot in a shared L2 and CPUs get tasks fast – vs it starts cache-missing, so all CPUs complete their jobs & need the queue, so the queue ping-pongs and remains slow, etc.

 

I can totally arrange scenarios where things like queue/shared-cache placement makes a 10x performance difference.  Sometimes the queue is where it needs to be when it’s needed, and sometimes not.  I don’t have any easy fix for such algorithm sensitivities, except via careful discovery of the “performance potholes” then designing to avoid them.

 

All HS thin-locks auto-inflate as you add waiters; why do you want a special coding hack to avoid that?

Doug:

Because there is a higher-than-we’d-like probability that the signal will come before the wait.  The main unhappy case is where null-signaller reserves the lock, so the waiter (completely uselessly!!) waits for GC or some unbias point.

Cliff:

Huh?  ‘signal’ meaning the ‘notify’?  as in: you are sending a ‘notify’ to a lock with no waiters, so the notify is dropped on the floor? I guess I’m not getting it.

 

BTW, I dropped the ‘displaced header’ mechanism, and also the ‘thin lock’ state.  Azul locks have these states:

 

– never locked

– biased to a thread

– inflated monitor

– – monitor is biased (cost is same as ‘biased+one indirection’)

– – monitor is unlocked & will count recursive locks

– – monitor is locked & counting recursive locks (cost is same as ‘thin-lock+one indirection’)

– – monitor is biased & the bias is mid-revoke (cost involves safepointing 1 thread & crawling 1 thread stack)

 

Wait’ing inflates a lock & then lines up on the wait’ers list for the monitor. 

Notify Does The Right Thing, which for a non-inflated lock is a no-op.

 

I could make Notifies to biased-locks auto-inflate under the assumption that the notify is racing with a Wait’er.

Doug:

Yes, autoinflating on any notify would help here, and strikes me as rarely being a bad idea in general.

Cliff:

Noted.

Doug:

When waiting for pending join a worker does:

 

1. Decide whether to wake up or create spare thread

2. CAS in task state bits requesting signal

3. Do some other bookkeeping and counter updates

4. get task synch lock and Object.wait for task state to change

Cliff:

So it’s a missed-notify by design? Perhaps you could flip the task state bits under the task sync lock?

Doug:

I used to do something like this. It flips bias to the soon-to-be waiting thread, which is even worse.

Cliff:

why is it worse? For Azul, the call to wait() forces inflation.  And now it sounds to me like you want wait’rs to spin a bit, to see if a quick notify is coming along.  Also, inflation is a 1-shot cost per-lock. Are you constantly locking new objects (hence the cost to inflate is added to the original allocation cost)?

Doug:

Yes, each join is waited on using different task object, and 99+% of time, only one join per task.

Cliff:

Blah.  Unlike Park’s this will mean making a Monitor object per join, and reclaiming them later.

Cliff:

Also, if you are using missed-notifies by design, do you want to do timed waits?  That way if you miss you’re stuck waiter will eventually get to try again.

Doug:

I don’t know why I hadn’t checked this lately (for many months) but comparing -XX:-UseBiasedLocking shows that I definitely am getting hurt by this in at least some applications — as much as 20% aggregate throughput differences on a couple tests depending on thresholding and spinning details in preJoin.

Doug:

On the other side (a thread completing a task being joined), it is possible and not all that uncommon to see the signal request from (2) above, and thus do notifyAll in a sync block before the waiting thread even waits. When this happens, it can cause lock bias to kick in in favor of the signaller, so the waiter cannot even get the lock to do the wait. Annoying.

Cliff

This sounds like a Sun-ism.  Azul BiasedLocks can flip owners without too much overhead and we do it on a lock-by-lock basis. For Sun, revoking the bias on an object is quite expensive.  It’s cheaper on Azul, but I dunno if it’s cheap enough. Definitely if a wait OR notify hits an object I should immediately inflate and skip the bias/revoke-bias steps.

 

Doug:

I’ve considered several times avoiding all these lock anti-optimizations by making my own monitor-cache-like mechanism (and then using plain park/unpark), but have never actually tried it. I probably ought to, to see if the lookup etc overhead can be made small enough that I can stop relying on lock mechanics that can vary so much.  Seems unlikely (among other issues, I’d need weak-ref based lookup, which amounts to trading one set of uncontrollable weirdnesses for another  🙂

 

It is probably better to continue pushing on smarter adaptive spinning (as in notes a few days ago) so that unwanted lock effects become even less frequent.

 

All in all, it is probably time to stop trying to cope with interactions with VM locking heuristics and to explore some schemes avoiding sync locks. I just thought of one that also avoids weak-refs (which is what I was most worried about) at the expense of a bunch of code bulk, which might get me in trouble wrt inlining and compile thresholds etc again, but worth a shot.

 

HPC folks, even cilk, usually don’t even bother suspending threads in these cases. They own the machines, so are content to allow spins for even many seconds. We can’t do that. FJ does though support helpJoin, that does non-suspending helping joins that process other tasks while waiting for the target to complete. This works great so long as you have “strict” computations (pure trees, no nested parallelism). That LeftSpineFib test uses this, and has much lower variance across runs of top-level tasks. Too bad that layered APIs like ParallelArray or Clojure/Scala/X10/java-with-closures frameworks can’t usually use it because they don’t know shape of computations.

 


Topic change again…

Doug:

When using FJ for parallel forms of apply, map, reduce, etc on arrays and collections, we’d like to have some idea of how much computation per element there is. As in coll.apply(addOneToEachElement) would only spawn subtasks if a lot of elements (and cores), but coll.apply(findNextMoveInGame) might always do so. It seems that performance counters could provide some help here if they were exposed in some way to library code. Without this I can still do OK in many cases by using queue-sensing etc to determine dynamic thresholds. The only big losses come in the extreme cases where you don’t want to partition at all (like add one to 10 elements), which would be nice to avoid.

 

(In essence, counters are an approximation of WCETs used for similar purposes for RT stuff.)

 

Any thoughts on plausibility? If you think this might go anywhere, I’d probably want to drag in some Sun hotspot and IBM j9 folks because we’d want to at least semi-standardize it.

Cliff:

You could sample via nanoTime execution.  You could trend the last 10 (100, 1000?) nanoTime execs pretty cheaply.

 

You could also waffle the chunk size up & down, profiling as you go.  Average-time-per-chunk seems low?  Then increase chunk-size.  Seems high (and CPUs are available?)- make more smaller chunks.  etc.

 

(this conversation is ongoing, and currently Doug is “it”.)

 

Cliff