sleep.html 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548
  1. <html>
  2. <title>
  3. data
  4. </title>
  5. <body BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#330088" ALINK="#FF0044">
  6. <H1>Process Sleep and Wakeup on a Shared-memory Multiprocessor
  7. </H1>
  8. <DL><DD><I>Rob Pike<br>
  9. Dave Presotto<br>
  10. Ken Thompson<br>
  11. Gerard Holzmann<br>
  12. <br>&#32;<br>
  13. rob,presotto,ken,gerard@plan9.bell-labs.com<br>
  14. </I></DL>
  15. <DL><DD><H4>ABSTRACT</H4>
  16. <DL>
  17. <DT><DT>&#32;<DD>
  18. NOTE:<I> Appeared in a slightly different form in
  19. Proceedings of the Spring 1991 EurOpen Conference,
  20. Troms&oslash;, Norway, 1991, pp. 161-166.
  21. </I><DT>&#32;<DD></dl>
  22. <br>
  23. The problem of enabling a `sleeping' process on a shared-memory multiprocessor
  24. is a difficult one, especially if the process is to be awakened by an interrupt-time
  25. event. We present here the code
  26. for sleep and wakeup primitives that we use in our multiprocessor system.
  27. The code has been exercised by years of active use and by a verification
  28. system.
  29. </DL>
  30. <br>&#32;<br>
  31. Our problem is to synchronise processes on a symmetric shared-memory multiprocessor.
  32. Processes suspend execution, or
  33. <I>sleep,</I>
  34. while awaiting an enabling event such as an I/O interrupt.
  35. When the event occurs, the process is issued a
  36. <I>wakeup</I>
  37. to resume its execution.
  38. During these events, other processes may be running and other interrupts
  39. occurring on other processors.
  40. <br>&#32;<br>
  41. More specifically, we wish to implement subroutines called
  42. <TT>sleep</TT>,
  43. callable by a process to relinquish control of its current processor,
  44. and
  45. <TT>wakeup</TT>,
  46. callable by another process or an interrupt to resume the execution
  47. of a suspended process.
  48. The calling conventions of these subroutines will remain unspecified
  49. for the moment.
  50. <br>&#32;<br>
  51. We assume the processors have an atomic test-and-set or equivalent
  52. operation but no other synchronisation method. Also, we assume interrupts
  53. can occur on any processor at any time, except on a processor that has
  54. locally inhibited them.
  55. <br>&#32;<br>
  56. The problem is the generalisation to a multiprocessor of a familiar
  57. and well-understood uniprocessor problem. It may be reduced to a
  58. uniprocessor problem by using a global test-and-set to serialise the
  59. sleeps and wakeups,
  60. which is equivalent to synchronising through a monitor.
  61. For performance and cleanliness, however,
  62. we prefer to allow the interrupt handling and process control to be multiprocessed.
  63. <br>&#32;<br>
  64. Our attempts to solve the sleep/wakeup problem in Plan 9
  65. [Pik90]
  66. prompted this paper.
  67. We implemented solutions several times over several months and each
  68. time convinced ourselves &#173; wrongly &#173; they were correct.
  69. Multiprocessor algorithms can be
  70. difficult to prove correct by inspection and formal reasoning about them
  71. is impractical. We finally developed an algorithm we trust by
  72. verifying our code using an
  73. empirical testing tool.
  74. We present that code here, along with some comments about the process by
  75. which it was designed.
  76. <H4>History
  77. </H4>
  78. <br>&#32;<br>
  79. Since processes in Plan 9 and the UNIX
  80. system have similar structure and properties, one might ask if
  81. UNIX
  82. <TT>sleep</TT>
  83. and
  84. <TT>wakeup</TT>
  85. [Bac86]
  86. could not easily be adapted from their standard uniprocessor implementation
  87. to our multiprocessor needs.
  88. The short answer is, no.
  89. <br>&#32;<br>
  90. The
  91. UNIX
  92. routines
  93. take as argument a single global address
  94. that serves as a unique
  95. identifier to connect the wakeup with the appropriate process or processes.
  96. This has several inherent disadvantages.
  97. From the point of view of
  98. <TT>sleep</TT>
  99. and
  100. <TT>wakeup</TT>,
  101. it is difficult to associate a data structure with an arbitrary address;
  102. the routines are unable to maintain a state variable recording the
  103. status of the event and processes.
  104. (The reverse is of course easy &#173; we could
  105. require the address to point to a special data structure &#173;
  106. but we are investigating
  107. UNIX
  108. <TT>sleep</TT>
  109. and
  110. <TT>wakeup</TT>,
  111. not the code that calls them.)
  112. Also, multiple processes sleep `on' a given address, so
  113. <TT>wakeup</TT>
  114. must enable them all, and let process scheduling determine which process
  115. actually benefits from the event.
  116. This is inefficient;
  117. a queueing mechanism would be preferable
  118. but, again, it is difficult to associate a queue with a general address.
  119. Moreover, the lack of state means that
  120. <TT>sleep</TT>
  121. and
  122. <TT>wakeup</TT>
  123. cannot know what the corresponding process (or interrupt) is doing;
  124. <TT>sleep</TT>
  125. and
  126. <TT>wakeup</TT>
  127. must be executed atomically.
  128. On a uniprocessor it suffices to disable interrupts during their
  129. execution.
  130. On a multiprocessor, however,
  131. most processors
  132. can inhibit interrupts only on the current processor,
  133. so while a process is executing
  134. <TT>sleep</TT>
  135. the desired interrupt can come and go on another processor.
  136. If the wakeup is to be issued by another process, the problem is even harder.
  137. Some inter-process mutual exclusion mechanism must be used,
  138. which, yet again, is difficult to do without a way to communicate state.
  139. <br>&#32;<br>
  140. In summary, to be useful on a multiprocessor,
  141. UNIX
  142. <TT>sleep</TT>
  143. and
  144. <TT>wakeup</TT>
  145. must either be made to run atomically on a single
  146. processor (such as by using a monitor)
  147. or they need a richer model for their communication.
  148. <H4>The design
  149. </H4>
  150. <br>&#32;<br>
  151. Consider the case of an interrupt waking up a sleeping process.
  152. (The other case, a process awakening a second process, is easier because
  153. atomicity can be achieved using an interlock.)
  154. The sleeping process is waiting for some event to occur, which may be
  155. modeled by a condition coming true.
  156. The condition could be just that the event has happened, or something
  157. more subtle such as a queue draining below some low-water mark.
  158. We represent the condition by a function of one
  159. argument of type
  160. <TT>void*</TT>;
  161. the code supporting the device generating the interrupts
  162. provides such a function to be used by
  163. <TT>sleep</TT>
  164. and
  165. <TT>wakeup</TT>
  166. to synchronise. The function returns
  167. <TT>false</TT>
  168. if the event has not occurred, and
  169. <TT>true</TT>
  170. some time after the event has occurred.
  171. The
  172. <TT>sleep</TT>
  173. and
  174. <TT>wakeup</TT>
  175. routines must, of course, work correctly if the
  176. event occurs while the process is executing
  177. <TT>sleep</TT>.
  178. <br>&#32;<br>
  179. We assume that a particular call to
  180. <TT>sleep</TT>
  181. corresponds to a particular call to
  182. <TT>wakeup</TT>,
  183. that is,
  184. at most one process is asleep waiting for a particular event.
  185. This can be guaranteed in the code that calls
  186. <TT>sleep</TT>
  187. and
  188. <TT>wakeup</TT>
  189. by appropriate interlocks.
  190. We also assume for the moment that there will be only one interrupt
  191. and that it may occur at any time, even before
  192. <TT>sleep</TT>
  193. has been called.
  194. <br>&#32;<br>
  195. For performance,
  196. we desire that multiple instances of
  197. <TT>sleep</TT>
  198. and
  199. <TT>wakeup</TT>
  200. may be running simultaneously on our multiprocessor.
  201. For example, a process calling
  202. <TT>sleep</TT>
  203. to await a character from an input channel need not
  204. wait for another process to finish executing
  205. <TT>sleep</TT>
  206. to await a disk block.
  207. At a finer level, we would like a process reading from one input channel
  208. to be able to execute
  209. <TT>sleep</TT>
  210. in parallel with a process reading from another input channel.
  211. A standard approach to synchronisation is to interlock the channel `driver'
  212. so that only one process may be executing in the channel code at once.
  213. This method is clearly inadequate for our purposes; we need
  214. fine-grained synchronisation, and in particular to apply
  215. interlocks at the level of individual channels rather than at the level
  216. of the channel driver.
  217. <br>&#32;<br>
  218. Our approach is to use an object called a
  219. <I>rendezvous</I>,
  220. which is a data structure through which
  221. <TT>sleep</TT>
  222. and
  223. <TT>wakeup</TT>
  224. synchronise.
  225. (The similarly named construct in Ada is a control structure;
  226. ours is an unrelated data structure.)
  227. A rendezvous
  228. is allocated for each active source of events:
  229. one for each I/O channel,
  230. one for each end of a pipe, and so on.
  231. The rendezvous serves as an interlockable structure in which to record
  232. the state of the sleeping process, so that
  233. <TT>sleep</TT>
  234. and
  235. <TT>wakeup</TT>
  236. can communicate if the event happens before or while
  237. <TT>sleep</TT>
  238. is executing.
  239. <br>&#32;<br>
  240. Our design for
  241. <TT>sleep</TT>
  242. is therefore a function
  243. <DL><DT><DD><TT><PRE>
  244. void sleep(Rendezvous *r, int (*condition)(void*), void *arg)
  245. </PRE></TT></DL>
  246. called by the sleeping process.
  247. The argument
  248. <TT>r</TT>
  249. connects the call to
  250. <TT>sleep</TT>
  251. with the call to
  252. <TT>wakeup</TT>,
  253. and is part of the data structure for the (say) device.
  254. The function
  255. <TT>condition</TT>
  256. is described above;
  257. called with argument
  258. <TT>arg</TT>,
  259. it is used by
  260. <TT>sleep</TT>
  261. to decide whether the event has occurred.
  262. <TT>Wakeup</TT>
  263. has a simpler specification:
  264. <DL><DT><DD><TT><PRE>
  265. void wakeup(Rendezvous *r).
  266. </PRE></TT></DL>
  267. <TT>Wakeup</TT>
  268. must be called after the condition has become true.
  269. <H4>An implementation
  270. </H4>
  271. <br>&#32;<br>
  272. The
  273. <TT>Rendezvous</TT>
  274. data type is defined as
  275. <DL><DT><DD><TT><PRE>
  276. typedef struct{
  277. Lock l;
  278. Proc *p;
  279. }Rendezvous;
  280. </PRE></TT></DL>
  281. Our
  282. <TT>Locks</TT>
  283. are test-and-set spin locks.
  284. The routine
  285. <TT>lock(Lock *l)</TT>
  286. returns when the current process holds that lock;
  287. <TT>unlock(Lock *l)</TT>
  288. releases the lock.
  289. <br>&#32;<br>
  290. Here is our implementation of
  291. <TT>sleep</TT>.
  292. Its details are discussed below.
  293. <TT>Thisp</TT>
  294. is a pointer to the current process on the current processor.
  295. (Its value differs on each processor.)
  296. <DL><DT><DD><TT><PRE>
  297. void
  298. sleep(Rendezvous *r, int (*condition)(void*), void *arg)
  299. {
  300. int s;
  301. s = inhibit(); /* interrupts */
  302. lock(&amp;r-&#62;l);
  303. /*
  304. * if condition happened, never mind
  305. */
  306. if((*condition)(arg)){
  307. unlock(&amp;r-&#62;l);
  308. allow(); /* interrupts */
  309. return;
  310. }
  311. /*
  312. * now we are committed to
  313. * change state and call scheduler
  314. */
  315. if(r-&#62;p)
  316. error("double sleep %d %d", r-&#62;p-&#62;pid, thisp-&#62;pid);
  317. thisp-&#62;state = Wakeme;
  318. r-&#62;p = thisp;
  319. unlock(&amp;r-&#62;l);
  320. allow(s); /* interrupts */
  321. sched(); /* relinquish CPU */
  322. }
  323. </PRE></TT></DL>
  324. Here is
  325. <TT>wakeup.</TT>
  326. <DL><DT><DD><TT><PRE>
  327. void
  328. wakeup(Rendezvous *r)
  329. {
  330. Proc *p;
  331. int s;
  332. s = inhibit(); /* interrupts; return old state */
  333. lock(&amp;r-&#62;l);
  334. p = r-&#62;p;
  335. if(p){
  336. r-&#62;p = 0;
  337. if(p-&#62;state != Wakeme)
  338. panic("wakeup: not Wakeme");
  339. ready(p);
  340. }
  341. unlock(&amp;r-&#62;l);
  342. if(s)
  343. allow();
  344. }
  345. </PRE></TT></DL>
  346. <TT>Sleep</TT>
  347. and
  348. <TT>wakeup</TT>
  349. both begin by disabling interrupts
  350. and then locking the rendezvous structure.
  351. Because
  352. <TT>wakeup</TT>
  353. may be called in an interrupt routine, the lock must be set only
  354. with interrupts disabled on the current processor,
  355. so that if the interrupt comes during
  356. <TT>sleep</TT>
  357. it will occur only on a different processor;
  358. if it occurred on the processor executing
  359. <TT>sleep</TT>,
  360. the spin lock in
  361. <TT>wakeup</TT>
  362. would hang forever.
  363. At the end of each routine, the lock is released and processor priority
  364. returned to its previous value.
  365. (<TT>Wakeup</TT>
  366. needs to inhibit interrupts in case
  367. it is being called by a process;
  368. this is a no-op if called by an interrupt.)
  369. <br>&#32;<br>
  370. <TT>Sleep</TT>
  371. checks to see if the condition has become true, and returns if so.
  372. Otherwise the process posts its name in the rendezvous structure where
  373. <TT>wakeup</TT>
  374. may find it, marks its state as waiting to be awakened
  375. (this is for error checking only) and goes to sleep by calling
  376. <TT>sched()</TT>.
  377. The manipulation of the rendezvous structure is all done under the lock,
  378. and
  379. <TT>wakeup</TT>
  380. only examines it under lock, so atomicity and mutual exclusion
  381. are guaranteed.
  382. <br>&#32;<br>
  383. <TT>Wakeup</TT>
  384. has a simpler job. When it is called, the condition has implicitly become true,
  385. so it locks the rendezvous, sees if a process is waiting, and readies it to run.
  386. <H4>Discussion
  387. </H4>
  388. <br>&#32;<br>
  389. The synchronisation technique used here
  390. is similar to known methods, even as far back as Saltzer's thesis
  391. [Sal66].
  392. The code looks trivially correct in retrospect: all access to data structures is done
  393. under lock, and there is no place that things may get out of order.
  394. Nonetheless, it took us several iterations to arrive at the above
  395. implementation, because the things that
  396. <I>can</I>
  397. go wrong are often hard to see. We had four earlier implementations
  398. that were examined at great length and only found faulty when a new,
  399. different style of device or activity was added to the system.
  400. <br>&#32;<br>
  401. Here, for example, is an incorrect implementation of wakeup,
  402. closely related to one of our versions.
  403. <DL><DT><DD><TT><PRE>
  404. void
  405. wakeup(Rendezvous *r)
  406. {
  407. Proc *p;
  408. int s;
  409. p = r-&#62;p;
  410. if(p){
  411. s = inhibit();
  412. lock(&amp;r-&#62;l);
  413. r-&#62;p = 0;
  414. if(p-&#62;state != Wakeme)
  415. panic("wakeup: not Wakeme");
  416. ready(p);
  417. unlock(&amp;r-&#62;l);
  418. if(s)
  419. allow();
  420. }
  421. }
  422. </PRE></TT></DL>
  423. The mistake is that the reading of
  424. <TT>r->p</TT>
  425. may occur just as the other process calls
  426. <TT>sleep</TT>,
  427. so when the interrupt examines the structure it sees no one to wake up,
  428. and the sleeping process misses its wakeup.
  429. We wrote the code this way because we reasoned that the fetch
  430. <TT>p</TT>
  431. <TT>=</TT>
  432. <TT>r->p</TT>
  433. was inherently atomic and need not be interlocked.
  434. The bug was found by examination when a new, very fast device
  435. was added to the system and sleeps and interrupts were closely overlapped.
  436. However, it was in the system for a couple of months without causing an error.
  437. <br>&#32;<br>
  438. How many errors lurk in our supposedly correct implementation above?
  439. We would like a way to guarantee correctness; formal proofs are beyond
  440. our abilities when the subtleties of interrupts and multiprocessors are
  441. involved.
  442. With that in mind, the first three authors approached the last to see
  443. if his automated tool for checking protocols
  444. [Hol91]
  445. could be
  446. used to verify our new
  447. <TT>sleep</TT>
  448. and
  449. <TT>wakeup</TT>
  450. for correctness.
  451. The code was translated into the language for that system
  452. (with, unfortunately, no way of proving that the translation is itself correct)
  453. and validated by exhaustive simulation.
  454. <br>&#32;<br>
  455. The validator found a bug.
  456. Under our assumption that there is only one interrupt, the bug cannot
  457. occur, but in the more general case of multiple interrupts synchronising
  458. through the same condition function and rendezvous,
  459. the process and interrupt can enter a peculiar state.
  460. A process may return from
  461. <TT>sleep</TT>
  462. with the condition function false
  463. if there is a delay between
  464. the condition coming true and
  465. <TT>wakeup</TT>
  466. being called,
  467. with the delay occurring
  468. just as the receiving process calls
  469. <TT>sleep</TT>.
  470. The condition is now true, so that process returns immediately,
  471. does whatever is appropriate, and then (say) decides to call
  472. <TT>sleep</TT>
  473. again. This time the condition is false, so it goes to sleep.
  474. The wakeup process then finds a sleeping process,
  475. and wakes it up, but the condition is now false.
  476. <br>&#32;<br>
  477. There is an easy (and verified) solution: at the end of
  478. <TT>sleep</TT>
  479. or after
  480. <TT>sleep</TT>
  481. returns,
  482. if the condition is false, execute
  483. <TT>sleep</TT>
  484. again. This re-execution cannot repeat; the second synchronisation is guaranteed
  485. to function under the external conditions we are supposing.
  486. <br>&#32;<br>
  487. Even though the original code is completely
  488. protected by interlocks and had been examined carefully by all of us
  489. and believed correct, it still had problems.
  490. It seems to us that some exhaustive automated analysis is
  491. required of multiprocessor algorithms to guarantee their safety.
  492. Our experience has confirmed that it is almost impossible to
  493. guarantee by inspection or simple testing the correctness
  494. of a multiprocessor algorithm. Testing can demonstrate the presence
  495. of bugs but not their absence
  496. [Dij72].
  497. <br>&#32;<br>
  498. We close by claiming that the code above with
  499. the suggested modification passes all tests we have for correctness
  500. under the assumptions used in the validation.
  501. We would not, however, go so far as to claim that it is universally correct.
  502. <H4>References
  503. </H4>
  504. <br>&#32;<br>
  505. [Bac86] Maurice J. Bach,
  506. <I>The Design of the UNIX Operating System,</I>
  507. Prentice-Hall,
  508. Englewood Cliffs,
  509. 1986.
  510. <br>&#32;<br>
  511. [Dij72] Edsger W. Dijkstra,
  512. ``The Humble Programmer - 1972 Turing Award Lecture'',
  513. <I>Comm. ACM,</I>
  514. 15(10), pp. 859-866,
  515. October 1972.
  516. <br>&#32;<br>
  517. [Hol91] Gerard J. Holzmann,
  518. <I>Design and Validation of Computer Protocols,</I>
  519. Prentice-Hall,
  520. Englewood Cliffs,
  521. 1991.
  522. <br>&#32;<br>
  523. [Pik90]
  524. Rob Pike,
  525. Dave Presotto,
  526. Ken Thompson,
  527. Howard Trickey,
  528. ``Plan 9 from Bell Labs'',
  529. <I>Proceedings of the Summer 1990 UKUUG Conference,</I>
  530. pp. 1-9,
  531. London,
  532. July, 1990.
  533. <br>&#32;<br>
  534. [Sal66] Jerome H. Saltzer,
  535. <I>Traffic Control in a Multiplexed Computer System</I>
  536. MIT,
  537. Cambridge, Mass.,
  538. 1966.
  539. <br>&#32;<br>
  540. <A href=http://www.lucent.com/copyright.html>
  541. Copyright</A> &#169; 2000 Lucent Technologies Inc. All rights reserved.
  542. </body></html>