|
- .TL
- Process Sleep and Wakeup on a Shared-memory Multiprocessor
- .AU
- Rob Pike
- Dave Presotto
- Ken Thompson
- Gerard Holzmann
- .sp
- rob,presotto,ken,gerard@plan9.bell-labs.com
- .AB
- .FS
- Appeared in a slightly different form in
- .I
- Proceedings of the Spring 1991 EurOpen Conference,
- .R
- Tromsø, Norway, 1991, pp. 161-166.
- .FE
- The problem of enabling a `sleeping' process on a shared-memory multiprocessor
- is a difficult one, especially if the process is to be awakened by an interrupt-time
- event. We present here the code
- for sleep and wakeup primitives that we use in our multiprocessor system.
- The code has been exercised by years of active use and by a verification
- system.
- .AE
- .LP
- Our problem is to synchronise processes on a symmetric shared-memory multiprocessor.
- Processes suspend execution, or
- .I sleep,
- while awaiting an enabling event such as an I/O interrupt.
- When the event occurs, the process is issued a
- .I wakeup
- to resume its execution.
- During these events, other processes may be running and other interrupts
- occurring on other processors.
- .LP
- More specifically, we wish to implement subroutines called
- .CW sleep ,
- callable by a process to relinquish control of its current processor,
- and
- .CW wakeup ,
- callable by another process or an interrupt to resume the execution
- of a suspended process.
- The calling conventions of these subroutines will remain unspecified
- for the moment.
- .LP
- We assume the processors have an atomic test-and-set or equivalent
- operation but no other synchronisation method. Also, we assume interrupts
- can occur on any processor at any time, except on a processor that has
- locally inhibited them.
- .LP
- The problem is the generalisation to a multiprocessor of a familiar
- and well-understood uniprocessor problem. It may be reduced to a
- uniprocessor problem by using a global test-and-set to serialise the
- sleeps and wakeups,
- which is equivalent to synchronising through a monitor.
- For performance and cleanliness, however,
- we prefer to allow the interrupt handling and process control to be multiprocessed.
- .LP
- Our attempts to solve the sleep/wakeup problem in Plan 9
- [Pik90]
- prompted this paper.
- We implemented solutions several times over several months and each
- time convinced ourselves \(em wrongly \(em they were correct.
- Multiprocessor algorithms can be
- difficult to prove correct by inspection and formal reasoning about them
- is impractical. We finally developed an algorithm we trust by
- verifying our code using an
- empirical testing tool.
- We present that code here, along with some comments about the process by
- which it was designed.
- .SH
- History
- .LP
- Since processes in Plan 9 and the UNIX
- system have similar structure and properties, one might ask if
- UNIX
- .CW sleep
- and
- .CW wakeup
- [Bac86]
- could not easily be adapted from their standard uniprocessor implementation
- to our multiprocessor needs.
- The short answer is, no.
- .LP
- The
- UNIX
- routines
- take as argument a single global address
- that serves as a unique
- identifier to connect the wakeup with the appropriate process or processes.
- This has several inherent disadvantages.
- From the point of view of
- .CW sleep
- and
- .CW wakeup ,
- it is difficult to associate a data structure with an arbitrary address;
- the routines are unable to maintain a state variable recording the
- status of the event and processes.
- (The reverse is of course easy \(em we could
- require the address to point to a special data structure \(em
- but we are investigating
- UNIX
- .CW sleep
- and
- .CW wakeup ,
- not the code that calls them.)
- Also, multiple processes sleep `on' a given address, so
- .CW wakeup
- must enable them all, and let process scheduling determine which process
- actually benefits from the event.
- This is inefficient;
- a queueing mechanism would be preferable
- but, again, it is difficult to associate a queue with a general address.
- Moreover, the lack of state means that
- .CW sleep
- and
- .CW wakeup
- cannot know what the corresponding process (or interrupt) is doing;
- .CW sleep
- and
- .CW wakeup
- must be executed atomically.
- On a uniprocessor it suffices to disable interrupts during their
- execution.
- On a multiprocessor, however,
- most processors
- can inhibit interrupts only on the current processor,
- so while a process is executing
- .CW sleep
- the desired interrupt can come and go on another processor.
- If the wakeup is to be issued by another process, the problem is even harder.
- Some inter-process mutual exclusion mechanism must be used,
- which, yet again, is difficult to do without a way to communicate state.
- .LP
- In summary, to be useful on a multiprocessor,
- UNIX
- .CW sleep
- and
- .CW wakeup
- must either be made to run atomically on a single
- processor (such as by using a monitor)
- or they need a richer model for their communication.
- .SH
- The design
- .LP
- Consider the case of an interrupt waking up a sleeping process.
- (The other case, a process awakening a second process, is easier because
- atomicity can be achieved using an interlock.)
- The sleeping process is waiting for some event to occur, which may be
- modeled by a condition coming true.
- The condition could be just that the event has happened, or something
- more subtle such as a queue draining below some low-water mark.
- We represent the condition by a function of one
- argument of type
- .CW void* ;
- the code supporting the device generating the interrupts
- provides such a function to be used by
- .CW sleep
- and
- .CW wakeup
- to synchronise. The function returns
- .CW false
- if the event has not occurred, and
- .CW true
- some time after the event has occurred.
- The
- .CW sleep
- and
- .CW wakeup
- routines must, of course, work correctly if the
- event occurs while the process is executing
- .CW sleep .
- .LP
- We assume that a particular call to
- .CW sleep
- corresponds to a particular call to
- .CW wakeup ,
- that is,
- at most one process is asleep waiting for a particular event.
- This can be guaranteed in the code that calls
- .CW sleep
- and
- .CW wakeup
- by appropriate interlocks.
- We also assume for the moment that there will be only one interrupt
- and that it may occur at any time, even before
- .CW sleep
- has been called.
- .LP
- For performance,
- we desire that multiple instances of
- .CW sleep
- and
- .CW wakeup
- may be running simultaneously on our multiprocessor.
- For example, a process calling
- .CW sleep
- to await a character from an input channel need not
- wait for another process to finish executing
- .CW sleep
- to await a disk block.
- At a finer level, we would like a process reading from one input channel
- to be able to execute
- .CW sleep
- in parallel with a process reading from another input channel.
- A standard approach to synchronisation is to interlock the channel `driver'
- so that only one process may be executing in the channel code at once.
- This method is clearly inadequate for our purposes; we need
- fine-grained synchronisation, and in particular to apply
- interlocks at the level of individual channels rather than at the level
- of the channel driver.
- .LP
- Our approach is to use an object called a
- .I rendezvous ,
- which is a data structure through which
- .CW sleep
- and
- .CW wakeup
- synchronise.
- (The similarly named construct in Ada is a control structure;
- ours is an unrelated data structure.)
- A rendezvous
- is allocated for each active source of events:
- one for each I/O channel,
- one for each end of a pipe, and so on.
- The rendezvous serves as an interlockable structure in which to record
- the state of the sleeping process, so that
- .CW sleep
- and
- .CW wakeup
- can communicate if the event happens before or while
- .CW sleep
- is executing.
- .LP
- Our design for
- .CW sleep
- is therefore a function
- .P1
- void sleep(Rendezvous *r, int (*condition)(void*), void *arg)
- .P2
- called by the sleeping process.
- The argument
- .CW r
- connects the call to
- .CW sleep
- with the call to
- .CW wakeup ,
- and is part of the data structure for the (say) device.
- The function
- .CW condition
- is described above;
- called with argument
- .CW arg ,
- it is used by
- .CW sleep
- to decide whether the event has occurred.
- .CW Wakeup
- has a simpler specification:
- .P1
- void wakeup(Rendezvous *r).
- .P2
- .CW Wakeup
- must be called after the condition has become true.
- .SH
- An implementation
- .LP
- The
- .CW Rendezvous
- data type is defined as
- .P1
- typedef struct{
- Lock l;
- Proc *p;
- }Rendezvous;
- .P2
- Our
- .CW Locks
- are test-and-set spin locks.
- The routine
- .CW lock(Lock\ *l)
- returns when the current process holds that lock;
- .CW unlock(Lock\ *l)
- releases the lock.
- .LP
- Here is our implementation of
- .CW sleep .
- Its details are discussed below.
- .CW Thisp
- is a pointer to the current process on the current processor.
- (Its value differs on each processor.)
- .P1
- void
- sleep(Rendezvous *r, int (*condition)(void*), void *arg)
- {
- int s;
- s = inhibit(); /* interrupts */
- lock(&r->l);
- /*
- * if condition happened, never mind
- */
- if((*condition)(arg)){
- unlock(&r->l);
- allow(); /* interrupts */
- return;
- }
- /*
- * now we are committed to
- * change state and call scheduler
- */
- if(r->p)
- error("double sleep %d %d", r->p->pid, thisp->pid);
- thisp->state = Wakeme;
- r->p = thisp;
- unlock(&r->l);
- allow(s); /* interrupts */
- sched(); /* relinquish CPU */
- }
- .P2
- .ne 3i
- Here is
- .CW wakeup.
- .P1
- void
- wakeup(Rendezvous *r)
- {
- Proc *p;
- int s;
- s = inhibit(); /* interrupts; return old state */
- lock(&r->l);
- p = r->p;
- if(p){
- r->p = 0;
- if(p->state != Wakeme)
- panic("wakeup: not Wakeme");
- ready(p);
- }
- unlock(&r->l);
- if(s)
- allow();
- }
- .P2
- .CW Sleep
- and
- .CW wakeup
- both begin by disabling interrupts
- and then locking the rendezvous structure.
- Because
- .CW wakeup
- may be called in an interrupt routine, the lock must be set only
- with interrupts disabled on the current processor,
- so that if the interrupt comes during
- .CW sleep
- it will occur only on a different processor;
- if it occurred on the processor executing
- .CW sleep ,
- the spin lock in
- .CW wakeup
- would hang forever.
- At the end of each routine, the lock is released and processor priority
- returned to its previous value.
- .CW Wakeup "" (
- needs to inhibit interrupts in case
- it is being called by a process;
- this is a no-op if called by an interrupt.)
- .LP
- .CW Sleep
- checks to see if the condition has become true, and returns if so.
- Otherwise the process posts its name in the rendezvous structure where
- .CW wakeup
- may find it, marks its state as waiting to be awakened
- (this is for error checking only) and goes to sleep by calling
- .CW sched() .
- The manipulation of the rendezvous structure is all done under the lock,
- and
- .CW wakeup
- only examines it under lock, so atomicity and mutual exclusion
- are guaranteed.
- .LP
- .CW Wakeup
- has a simpler job. When it is called, the condition has implicitly become true,
- so it locks the rendezvous, sees if a process is waiting, and readies it to run.
- .SH
- Discussion
- .LP
- The synchronisation technique used here
- is similar to known methods, even as far back as Saltzer's thesis
- [Sal66].
- The code looks trivially correct in retrospect: all access to data structures is done
- under lock, and there is no place that things may get out of order.
- Nonetheless, it took us several iterations to arrive at the above
- implementation, because the things that
- .I can
- go wrong are often hard to see. We had four earlier implementations
- that were examined at great length and only found faulty when a new,
- different style of device or activity was added to the system.
- .LP
- .ne 3i
- Here, for example, is an incorrect implementation of wakeup,
- closely related to one of our versions.
- .P1
- void
- wakeup(Rendezvous *r)
- {
- Proc *p;
- int s;
- p = r->p;
- if(p){
- s = inhibit();
- lock(&r->l);
- r->p = 0;
- if(p->state != Wakeme)
- panic("wakeup: not Wakeme");
- ready(p);
- unlock(&r->l);
- if(s)
- allow();
- }
- }
- .P2
- The mistake is that the reading of
- .CW r->p
- may occur just as the other process calls
- .CW sleep ,
- so when the interrupt examines the structure it sees no one to wake up,
- and the sleeping process misses its wakeup.
- We wrote the code this way because we reasoned that the fetch
- .CW p
- .CW =
- .CW r->p
- was inherently atomic and need not be interlocked.
- The bug was found by examination when a new, very fast device
- was added to the system and sleeps and interrupts were closely overlapped.
- However, it was in the system for a couple of months without causing an error.
- .LP
- How many errors lurk in our supposedly correct implementation above?
- We would like a way to guarantee correctness; formal proofs are beyond
- our abilities when the subtleties of interrupts and multiprocessors are
- involved.
- With that in mind, the first three authors approached the last to see
- if his automated tool for checking protocols
- [Hol91]
- could be
- used to verify our new
- .CW sleep
- and
- .CW wakeup
- for correctness.
- The code was translated into the language for that system
- (with, unfortunately, no way of proving that the translation is itself correct)
- and validated by exhaustive simulation.
- .LP
- The validator found a bug.
- Under our assumption that there is only one interrupt, the bug cannot
- occur, but in the more general case of multiple interrupts synchronising
- through the same condition function and rendezvous,
- the process and interrupt can enter a peculiar state.
- A process may return from
- .CW sleep
- with the condition function false
- if there is a delay between
- the condition coming true and
- .CW wakeup
- being called,
- with the delay occurring
- just as the receiving process calls
- .CW sleep .
- The condition is now true, so that process returns immediately,
- does whatever is appropriate, and then (say) decides to call
- .CW sleep
- again. This time the condition is false, so it goes to sleep.
- The wakeup process then finds a sleeping process,
- and wakes it up, but the condition is now false.
- .LP
- There is an easy (and verified) solution: at the end of
- .CW sleep
- or after
- .CW sleep
- returns,
- if the condition is false, execute
- .CW sleep
- again. This re-execution cannot repeat; the second synchronisation is guaranteed
- to function under the external conditions we are supposing.
- .LP
- Even though the original code is completely
- protected by interlocks and had been examined carefully by all of us
- and believed correct, it still had problems.
- It seems to us that some exhaustive automated analysis is
- required of multiprocessor algorithms to guarantee their safety.
- Our experience has confirmed that it is almost impossible to
- guarantee by inspection or simple testing the correctness
- of a multiprocessor algorithm. Testing can demonstrate the presence
- of bugs but not their absence
- [Dij72].
- .LP
- We close by claiming that the code above with
- the suggested modification passes all tests we have for correctness
- under the assumptions used in the validation.
- We would not, however, go so far as to claim that it is universally correct.
- .SH
- References
- .LP
- [Bac86] Maurice J. Bach,
- .I "The Design of the UNIX Operating System,
- Prentice-Hall,
- Englewood Cliffs,
- 1986.
- .LP
- [Dij72] Edsger W. Dijkstra,
- ``The Humble Programmer \- 1972 Turing Award Lecture'',
- .I "Comm. ACM,
- 15(10), pp. 859-866,
- October 1972.
- .LP
- [Hol91] Gerard J. Holzmann,
- .I "Design and Validation of Computer Protocols,
- Prentice-Hall,
- Englewood Cliffs,
- 1991.
- .LP
- [Pik90]
- Rob Pike,
- Dave Presotto,
- Ken Thompson,
- Howard Trickey,
- ``Plan 9 from Bell Labs'',
- .I "Proceedings of the Summer 1990 UKUUG Conference,
- pp. 1-9,
- London,
- July, 1990.
- .LP
- [Sal66] Jerome H. Saltzer,
- .I "Traffic Control in a Multiplexed Computer System
- MIT,
- Cambridge, Mass.,
- 1966.
|