What is a wait-free algorithm? Wikipedia has a nice article on it. There’s the obvious definition: all threads complete the job in a finite number of steps – where a “step” is some unit of work granted a thread by having the OS give the thread some time-slice on a real CPU.
Wait-free isn’t interesting for loop-free (and not recursive) programs – there’s only a finite number of steps in a finite piece of straight-line code. However, wait-free is often used to describe multi-threaded programs – which often communicate between each other with some sort of atomic-update machine instruction (frequently a CAS instruction but sometimes it’s LL/SC and sometimes something else).
I.e. – most (all?) interesting wait-free algorithms rely on special atomic instructions (you always can write wait-free programs without any kind of atomic-update instruction but these tend to be really slow). Both CAS & LL/SC will fail if their initial conditions change and the “fix” for failing is usually to loop and try again.
To re-cap: wait-free algorithms in the real world rely on fragile failable machine instructions – and the failure is often fixed by looping while retrying. To remain a wait-free algorithm, the algorithm must *somehow* know the all loops are of finite length – i.e., failure is finite. This is often a reasonable assumption – under low load most machines’ atomic-update instructions mostly work. But now let us posit an adversarial OS.
The definition of wait-free doesn’t mention an Operating System – and yet real programs running on real hardware have to deal with a real OS. Programs often more threads than CPUs and one of the OS’s jobs is to schedule the threads – grant them “steps” on the CPU to make forward progress. The OS chooses when a thread gets CPU cycles (steps) and for how long. Typically OS’s attempt to be fair and efficient, granting all threads an equal number of steps in a round-robin fashion. However, the OS isn’t required to be completely fair and typically they are not (in the name of efficiency). A wait-free program can make forward progress in the face of an OS that is not being nice, i.e. adversarial OS.
Let’s assume the adversarial OS has to be at least slightly fair: it can’t deny cycles to a thread forever. Indeed, if the program runs forever the the OS must hand out an infinite number of cycles to all threads (the unfair adversary isn’t very interesting: the OS can simply choose to not hand out cycles some threads, or even to any thread – ala a hard-crash). What kind of damage can this fair adversarial OS do to our algorithms?
For starters, the OS can guarantee that no contended LL/SC ever succeeds! Its easy: the OS schedules threads right up until they execute the LL; then that thread isn’t allowed cycles again until another thread also executes an LL to the same address. For all known implementations of LL, this will cause the 1st thread to break the “link” part of load-linked. The first thread is then allowed cycles until it tries another LL. The 2nd thread is preempted at once of course, since it also just executed an LL. Hence the OS ends up allowing all threads an infinite number of cycles while guaranteeing no LL/SC ever succeeds.
What does this all mean? It means in theory (since the adversarial OS is a theoretic notion) that IBM, Alpha & MIPS are screwed: they cannot use their primary atomic-update operation to write any algorithm without dragging in the OS. i.e., no one can write even plain locking algorithms on their hardware unless you also can show that the OS is at least slightly not-adversarial. Truly random time-slices would work – but fixed-cycle time slicing will be adversarial for algorithms which are “beating” in tune with the time slice.
It’s a little worse than that: the “load” part of LL is a … well, a load. A contended load. It will miss in cache and thus be very slow, with a larger-than-normal cycle count attributed to it. The random-slice OS is slightly more likely to context switch immediately after the missing load than at other points in the program. Of course, the context switch gives other CPUs a huge chance to request the same address – which means that the poor loser thread is both going to lose this LL/SC but also take another cache miss when it retries (and thus have another greater chance of a context switch).
This is all very theoretical: OS writers try hard to make their OS’s not adversarial and exponential backoff/retry loops work great (or worked great on the 4-cpu PowerPC machines I used years ago). But you had to do it when you used LL/SC or you would get CPUs live-locked spinning on LL/SC.
How does CAS fair in the face of the fair adversarial OS? All hardware docs I could find claimed that a CAS does succeed if the initial conditions are correct. According to the docs, assuming that multiple threads attempt to update the same address via CAS using the same initial load then exactly one should succeed. Here then the adversary cannot stop at least one thread from making forward progress. This means that CAS can at least be used to make lock-free algorithms. The OS can arrange for some loser threads to always lose CAS races, by preempting them out after they load the initial CAS conditions and not granting them cycles until after the winner thread does a successful CAS update to the same location. In other words, you can’t use CAS directly to write wait-free algorithms (you always write wait-free algorithms without either CAS or LL/SC, although they tend to be brutally slow).
Now here’s another interesting bit: I distinctly recall using machines where CAS would sometimes fail for all threads even when everybody used the same correct initial conditions. Indeed, I want to claim it was true on early multi-socket Intel machines – but I cannot find any written documentation to support my memory! My fuzzy recollection was that since Intel didn’t build the motherboards, Intel would not guarantee forward progress on contended CAS’s – which had to do an external R/M/W cycle and the correctness of which all depended on the motherboard manufacturer. Indeed it seems to me that unless a CPU designer also designs the details of how CAS interacts across chip boundaries, then that CPU cannot make forward progress guarantees about CAS. I.e., like IBM has to claim the OS running on their gear is not a fair adversarial OS in order to make LL/SC useful, so also Intel has to claim that motherboards running their chips implement a strong version of the R/M/W cycle to make CAS useful.
Ok, this is all stretching the point a great deal. In practice these things work well- at least on smaller CPU count machines. But CPU counts are on the rise, and the issues mentioned here become less and less academic with each passing year. On an Azul 864-way machine, if all threads are attempting to CAS the same location in a tight loop Azul promises you exactly one succeeds. But also I can promise you that at least one CPU will perpetually fail unless you invoke some sort of fair-lock (which we do in the OS; Java threads can’t quit write CAS loops that are so tight as to prevent all forward progress to one CPU whereas hand-written asm in the OS did). I wonder how well a 1000-cpu LL/SC machine will fair on such loops and how much exponential backoff will be required (and how much inefficiency will that cause)?
Or maybe the answer is: on some future 1million-cpu machine, if you have to CAS all CPUs on the same cache line then you are screwed: performance is so bad it’s indistinguishable from a crash; you’ll hit ^C and try again. Hence we all end up learning how to write programs which don’t do that… I’ll let you know if I get it figured out… 🙂