123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695 |
- <html>
- <title>
- data
- </title>
- <body BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#330088" ALINK="#FF0044">
- <H1>The Use of Name Spaces in Plan 9
- </H1>
- <DL><DD><I>Rob Pike<br>
- Dave Presotto<br>
- Ken Thompson<br>
- Howard Trickey<br>
- Phil Winterbottom<br>
- Bell Laboratories, Murray Hill, NJ, 07974
- USA<br>
- </I></DL>
- <DL><DD><H4>ABSTRACT</H4>
- <DL>
- <DT><DT> <DD>
- NOTE:<I> Appeared in
- Operating Systems Review,
- Vol. 27, #2, April 1993, pp. 72-76
- (reprinted from
- Proceedings of the 5th ACM SIGOPS European Workshop,
- Mont Saint-Michel, 1992, Paper nº 34).
- </I><DT> <DD></dl>
- <br>
- Plan 9 is a distributed system built at the Computing Sciences Research
- Center of AT&T Bell Laboratories (now Lucent Technologies, Bell Labs) over the last few years.
- Its goal is to provide a production-quality system for software
- development and general computation using heterogeneous hardware
- and minimal software. A Plan 9 system comprises CPU and file
- servers in a central location connected together by fast networks.
- Slower networks fan out to workstation-class machines that serve as
- user terminals. Plan 9 argues that given a few carefully
- implemented abstractions
- it is possible to
- produce a small operating system that provides support for the largest systems
- on a variety of architectures and networks. The foundations of the system are
- built on two ideas: a per-process name space and a simple message-oriented
- file system protocol.
- </DL>
- <P>
- The operating system for the CPU servers and terminals is
- structured as a traditional kernel: a single compiled image
- containing code for resource management, process control,
- user processes,
- virtual memory, and I/O. Because the file server is a separate
- machine, the file system is not compiled in, although the management
- of the name space, a per-process attribute, is.
- The entire kernel for the multiprocessor SGI Power Series machine
- is 25000 lines of C,
- the largest part of which is code for four networks including the
- Ethernet with the Internet protocol suite.
- Fewer than 1500 lines are machine-specific, and a
- functional kernel with minimal I/O can be put together from
- source files totaling 6000 lines. [Pike90]
- </P>
- <P>
- The system is relatively small for several reasons.
- First, it is all new: it has not had time to accrete as many fixes
- and features as other systems.
- Also, other than the network protocol, it adheres to no
- external interface; in particular, it is not Unix-compatible.
- Economy stems from careful selection of services and interfaces.
- Finally, wherever possible the system is built around
- two simple ideas:
- every resource in the system, either local or remote,
- is represented by a hierarchical file system; and
- a user or process
- assembles a private view of the system by constructing a file
- name space
- that connects these resources. [Needham]
- </P>
- <H4>File Protocol
- </H4>
- <P>
- All resources in Plan 9 look like file systems.
- That does not mean that they are repositories for
- permanent files on disk, but that the interface to them
- is file-oriented: finding files (resources) in a hierarchical
- name tree, attaching to them by name, and accessing their contents
- by read and write calls.
- There are dozens of file system types in Plan 9, but only a few
- represent traditional files.
- At this level of abstraction, files in Plan 9 are similar
- to objects, except that files are already provided with naming,
- access, and protection methods that must be created afresh for
- objects. Object-oriented readers may approach the rest of this
- paper as a study in how to make objects look like files.
- </P>
- <P>
- The interface to file systems is defined by a protocol, called 9P,
- analogous but not very similar to the NFS protocol.
- The protocol talks about files, not blocks; given a connection to the root
- directory of a file server,
- the 9P messages navigate the file hierarchy, open files for I/O,
- and read or write arbitrary bytes in the files.
- 9P contains 17 message types: three for
- initializing and
- authenticating a connection and fourteen for manipulating objects.
- The messages are generated by the kernel in response to user- or
- kernel-level I/O requests.
- Here is a quick tour of the major message types.
- The
- <TT>auth</TT>
- and
- <TT>attach</TT>
- messages authenticate a connection, established by means outside 9P,
- and validate its user.
- The result is an authenticated
- <I>channel</I>
- that points to the root of the
- server.
- The
- <TT>clone</TT>
- message makes a new channel identical to an existing channel,
- which may be moved to a file on the server using a
- <TT>walk</TT>
- message to descend each level in the hierarchy.
- The
- <TT>stat</TT>
- and
- <TT>wstat</TT>
- messages read and write the attributes of the file pointed to by a channel.
- The
- <TT>open</TT>
- message prepares a channel for subsequent
- <TT>read</TT>
- and
- <TT>write</TT>
- messages to access the contents of the file, while
- <TT>create</TT>
- and
- <TT>remove</TT>
- perform, on the files, the actions implied by their names.
- The
- <TT>clunk</TT>
- message discards a channel without affecting the file.
- None of the 9P messages consider caching; file caches are provided,
- when needed, either within the server (centralized caching)
- or by implementing the cache as a transparent file system between the
- client and the 9P connection to the server (client caching).
- </P>
- <P>
- For efficiency, the connection to local
- kernel-resident file systems, misleadingly called
- <I>devices,</I>
- is by regular rather than remote procedure calls.
- The procedures map one-to-one with 9P message types.
- Locally each channel has an associated data structure
- that holds a type field used to index
- a table of procedure calls, one set per file system type,
- analogous to selecting the method set for an object.
- One kernel-resident file system, the
- mount device,
- translates the local 9P procedure calls into RPC messages to
- remote services over a separately provided transport protocol
- such as TCP or IL, a new reliable datagram protocol, or over a pipe to
- a user process.
- Write and read calls transmit the messages over the transport layer.
- The mount device is the sole bridge between the procedural
- interface seen by user programs and remote and user-level services.
- It does all associated marshaling, buffer
- management, and multiplexing and is
- the only integral RPC mechanism in Plan 9.
- The mount device is in effect a proxy object.
- There is no RPC stub compiler; instead the mount driver and
- all servers just share a library that packs and unpacks 9P messages.
- </P>
- <H4>Examples
- </H4>
- <P>
- One file system type serves
- permanent files from the main file server,
- a stand-alone multiprocessor system with a
- 350-gigabyte
- optical WORM jukebox that holds the data, fronted by a two-level
- block cache comprising 7 gigabytes of
- magnetic disk and 128 megabytes of RAM.
- Clients connect to the file server using any of a variety of
- networks and protocols and access files using 9P.
- The file server runs a distinct operating system and has no
- support for user processes; other than a restricted set of commands
- available on the console, all it does is answer 9P messages from clients.
- </P>
- <P>
- Once a day, at 5:00 AM,
- the file server sweeps through the cache blocks and marks dirty blocks
- copy-on-write.
- It creates a copy of the root directory
- and labels it with the current date, for example
- <TT>1995/0314</TT>.
- It then starts a background process to copy the dirty blocks to the WORM.
- The result is that the server retains an image of the file system as it was
- early each morning.
- The set of old root directories is accessible using 9P, so a client
- may examine backup files using ordinary commands.
- Several advantages stem from having the backup service implemented
- as a plain file system.
- Most obviously, ordinary commands can access them.
- For example, to see when a bug was fixed
- <DL><DT><DD><TT><PRE>
- grep 'mouse bug fix' 1995/*/sys/src/cmd/8½/file.c
- </PRE></TT></DL>
- The owner, access times, permissions, and other properties of the
- files are also backed up.
- Because it is a file system, the backup
- still has protections;
- it is not possible to subvert security by looking at the backup.
- </P>
- <P>
- The file server is only one type of file system.
- A number of unusual services are provided within the kernel as
- local file systems.
- These services are not limited to I/O devices such
- as disks. They include network devices and their associated protocols,
- the bitmap display and mouse,
- a representation of processes similar to
- <TT>/proc</TT>
- [Killian], the name/value pairs that form the `environment'
- passed to a new process, profiling services,
- and other resources.
- Each of these is represented as a file system ­
- directories containing sets of files ­
- but the constituent files do not represent permanent storage on disk.
- Instead, they are closer in properties to UNIX device files.
- </P>
- <P>
- For example, the
- <I>console</I>
- device contains the file
- <TT>/dev/cons</TT>,
- similar to the UNIX file
- <TT>/dev/console</TT>:
- when written,
- <TT>/dev/cons</TT>
- appends to the console typescript; when read,
- it returns characters typed on the keyboard.
- Other files in the console device include
- <TT>/dev/time</TT>,
- the number of seconds since the epoch,
- <TT>/dev/cputime</TT>,
- the computation time used by the process reading the device,
- <TT>/dev/pid</TT>,
- the process id of the process reading the device, and
- <TT>/dev/user</TT>,
- the login name of the user accessing the device.
- All these files contain text, not binary numbers,
- so their use is free of byte-order problems.
- Their contents are synthesized on demand when read; when written,
- they cause modifications to kernel data structures.
- </P>
- <P>
- The
- <I>process</I>
- device contains one directory per live local process, named by its numeric
- process id:
- <TT>/proc/1</TT>,
- <TT>/proc/2</TT>,
- etc.
- Each directory contains a set of files that access the process.
- For example, in each directory the file
- <TT>mem</TT>
- is an image of the virtual memory of the process that may be read or
- written for debugging.
- The
- <TT>text</TT>
- file is a sort of link to the file from which the process was executed;
- it may be opened to read the symbol tables for the process.
- The
- <TT>ctl</TT>
- file may be written textual messages such as
- <TT>stop</TT>
- or
- <TT>kill</TT>
- to control the execution of the process.
- The
- <TT>status</TT>
- file contains a fixed-format line of text containing information about
- the process: its name, owner, state, and so on.
- Text strings written to the
- <TT>note</TT>
- file are delivered to the process as
- <I>notes,</I>
- analogous to UNIX signals.
- By providing these services as textual I/O on files rather
- than as system calls (such as
- <TT>kill</TT>)
- or special-purpose operations (such as
- <TT>ptrace</TT>),
- the Plan 9 process device simplifies the implementation of
- debuggers and related programs.
- For example, the command
- <DL><DT><DD><TT><PRE>
- cat /proc/*/status
- </PRE></TT></DL>
- is a crude form of the
- <TT>ps</TT>
- command; the actual
- <TT>ps</TT>
- merely reformats the data so obtained.
- </P>
- <P>
- The
- <I>bitmap</I>
- device contains three files,
- <TT>/dev/mouse</TT>,
- <TT>/dev/screen</TT>,
- and
- <TT>/dev/bitblt</TT>,
- that provide an interface to the local bitmap display (if any) and pointing device.
- The
- <TT>mouse</TT>
- file returns a fixed-format record containing
- 1 byte of button state and 4 bytes each of
- <I>x</I>
- and
- <I>y</I>
- position of the mouse.
- If the mouse has not moved since the file was last read, a subsequent read will
- block.
- The
- <TT>screen</TT>
- file contains a memory image of the contents of the display;
- the
- <TT>bitblt</TT>
- file provides a procedural interface.
- Calls to the graphics library are translated into messages that are written
- to the
- <TT>bitblt</TT>
- file to perform bitmap graphics operations. (This is essentially a nested
- RPC protocol.)
- </P>
- <P>
- The various services being used by a process are gathered together into the
- process's
- name space,
- a single rooted hierarchy of file names.
- When a process forks, the child process shares the name space with the parent.
- Several system calls manipulate name spaces.
- Given a file descriptor
- <TT>fd</TT>
- that holds an open communications channel to a service,
- the call
- <DL><DT><DD><TT><PRE>
- mount(int fd, char *old, int flags)
- </PRE></TT></DL>
- authenticates the user and attaches the file tree of the service to
- the directory named by
- <TT>old</TT>.
- The
- <TT>flags</TT>
- specify how the tree is to be attached to
- <TT>old</TT>:
- replacing the current contents or appearing before or after the
- current contents of the directory.
- A directory with several services mounted is called a
- <I>union</I>
- directory and is searched in the specified order.
- The call
- <DL><DT><DD><TT><PRE>
- bind(char *new, char *old, int flags)
- </PRE></TT></DL>
- takes the portion of the existing name space visible at
- <TT>new</TT>,
- either a file or a directory, and makes it also visible at
- <TT>old</TT>.
- For example,
- <DL><DT><DD><TT><PRE>
- bind("1995/0301/sys/include", "/sys/include", REPLACE)
- </PRE></TT></DL>
- causes the directory of include files to be overlaid with its
- contents from the dump on March first.
- </P>
- <P>
- A process is created by the
- <TT>rfork</TT>
- system call, which takes as argument a bit vector defining which
- attributes of the process are to be shared between parent
- and child instead of copied.
- One of the attributes is the name space: when shared, changes
- made by either process are visible in the other; when copied,
- changes are independent.
- </P>
- <P>
- Although there is no global name space,
- for a process to function sensibly the local name spaces must adhere
- to global conventions.
- Nonetheless, the use of local name spaces is critical to the system.
- Both these ideas are illustrated by the use of the name space to
- handle heterogeneity.
- The binaries for a given architecture are contained in a directory
- named by the architecture, for example
- <TT>/mips/bin</TT>;
- in use, that directory is bound to the conventional location
- <TT>/bin</TT>.
- Programs such as shell scripts need not know the CPU type they are
- executing on to find binaries to run.
- A directory of private binaries
- is usually unioned with
- <TT>/bin</TT>.
- (Compare this to the
- ad hoc
- and special-purpose idea of the
- <TT>PATH</TT>
- variable, which is not used in the Plan 9 shell.)
- Local bindings are also helpful for debugging, for example by binding
- an old library to the standard place and linking a program to see
- if recent changes to the library are responsible for a bug in the program.
- </P>
- <P>
- The window system,
- <TT>8½</TT>
- [Pike91], is a server for files such as
- <TT>/dev/cons</TT>
- and
- <TT>/dev/bitblt</TT>.
- Each client sees a distinct copy of these files in its local
- name space: there are many instances of
- <TT>/dev/cons</TT>,
- each served by
- <TT>8½</TT>
- to the local name space of a window.
- Again,
- <TT>8½</TT>
- implements services using
- local name spaces plus the use
- of I/O to conventionally named files.
- Each client just connects its standard input, output, and error files
- to
- <TT>/dev/cons</TT>,
- with analogous operations to access bitmap graphics.
- Compare this to the implementation of
- <TT>/dev/tty</TT>
- on UNIX, which is done by special code in the kernel
- that overloads the file, when opened,
- with the standard input or output of the process.
- Special arrangement must be made by a UNIX window system for
- <TT>/dev/tty</TT>
- to behave as expected;
- <TT>8½</TT>
- instead uses the provision of the corresponding file as its
- central idea, which to succeed depends critically on local name spaces.
- </P>
- <P>
- The environment
- <TT>8½</TT>
- provides its clients is exactly the environment under which it is implemented:
- a conventional set of files in
- <TT>/dev</TT>.
- This permits the window system to be run recursively in one of its own
- windows, which is handy for debugging.
- It also means that if the files are exported to another machine,
- as described below, the window system or client applications may be
- run transparently on remote machines, even ones without graphics hardware.
- This mechanism is used for Plan 9's implementation of the X window
- system: X is run as a client of
- <TT>8½</TT>,
- often on a remote machine with lots of memory.
- In this configuration, using Ethernet to connect
- MIPS machines, we measure only a 10% degradation in graphics
- performance relative to running X on
- a bare Plan 9 machine.
- </P>
- <P>
- An unusual application of these ideas is a statistics-gathering
- file system implemented by a command called
- <TT>iostats</TT>.
- The command encapsulates a process in a local name space, monitoring 9P
- requests from the process to the outside world ­ the name space in which
- <TT>iostats</TT>
- is itself running. When the command completes,
- <TT>iostats</TT>
- reports usage and performance figures for file activity.
- For example
- <DL><DT><DD><TT><PRE>
- iostats 8½
- </PRE></TT></DL>
- can be used to discover how much I/O the window system
- does to the bitmap device, font files, and so on.
- </P>
- <P>
- The
- <TT>import</TT>
- command connects a piece of name space from a remote system
- to the local name space.
- Its implementation is to dial the remote machine and start
- a process there that serves the remote name space using 9P.
- It then calls
- <TT>mount</TT>
- to attach the connection to the name space and finally dies;
- the remote process continues to serve the files.
- One use is to access devices not available
- locally. For example, to write a floppy one may say
- <DL><DT><DD><TT><PRE>
- import lab.pc /a: /n/dos
- cp foo /n/dos/bar
- </PRE></TT></DL>
- The call to
- <TT>import</TT>
- connects the file tree from
- <TT>/a:</TT>
- on the machine
- <TT>lab.pc</TT>
- (which must support 9P) to the local directory
- <TT>/n/dos</TT>.
- Then the file
- <TT>foo</TT>
- can be written to the floppy just by copying it across.
- </P>
- <P>
- Another application is remote debugging:
- <DL><DT><DD><TT><PRE>
- import helix /proc
- </PRE></TT></DL>
- makes the process file system on machine
- <TT>helix</TT>
- available locally; commands such as
- <TT>ps</TT>
- then see
- <TT>helix</TT>'s
- processes instead of the local ones.
- The debugger may then look at a remote process:
- <DL><DT><DD><TT><PRE>
- db /proc/27/text /proc/27/mem
- </PRE></TT></DL>
- allows breakpoint debugging of the remote process.
- Since
- <TT>db</TT>
- infers the CPU type of the process from the executable header on
- the text file, it supports
- cross-architecture debugging, too.
- Care is taken within
- <TT>db</TT>
- to handle issues of byte order and floating point; it is possible to
- breakpoint debug a big-endian MIPS process from a little-endian i386.
- </P>
- <P>
- Network interfaces are also implemented as file systems [Presotto].
- For example,
- <TT>/net/tcp</TT>
- is a directory somewhat like
- <TT>/proc</TT>:
- it contains a set of numbered directories, one per connection,
- each of which contains files to control and communicate on the connection.
- A process allocates a new connection by accessing
- <TT>/net/tcp/clone</TT>,
- which evaluates to the directory of an unused connection.
- To make a call, the process writes a textual message such as
- <TT>'connect</TT>
- <TT>135.104.53.2!512'</TT>
- to the
- <TT>ctl</TT>
- file and then reads and writes the
- <TT>data</TT>
- file.
- An
- <TT>rlogin</TT>
- service can be implemented in a few of lines of shell code.
- </P>
- <P>
- This structure makes network gatewaying easy to provide.
- We have machines with Datakit interfaces but no Internet interface.
- On such a machine one may type
- <DL><DT><DD><TT><PRE>
- import helix /net
- telnet tcp!ai.mit.edu
- </PRE></TT></DL>
- The
- <TT>import</TT>
- uses Datakit to pull in the TCP interface from
- <TT>helix</TT>,
- which can then be used directly; the
- <TT>tcp!</TT>
- notation is necessary because we routinely use multiple networks
- and protocols on Plan 9­it identifies the network in which
- <TT>ai.mit.edu</TT>
- is a valid name.
- </P>
- <P>
- In practice we do not use
- <TT>rlogin</TT>
- or
- <TT>telnet</TT>
- between Plan 9 machines. Instead a command called
- <TT>cpu</TT>
- in effect replaces the CPU in a window with that
- on another machine, typically a fast multiprocessor CPU server.
- The implementation is to recreate the
- name space on the remote machine, using the equivalent of
- <TT>import</TT>
- to connect pieces of the terminal's name space to that of
- the process (shell) on the CPU server, making the terminal
- a file server for the CPU.
- CPU-local devices such as fast file system connections
- are still local; only terminal-resident devices are
- imported.
- The result is unlike UNIX
- <TT>rlogin</TT>,
- which moves into a distinct name space on the remote machine,
- or file sharing with
- <TT>NFS</TT>,
- which keeps the name space the same but forces processes to execute
- locally.
- Bindings in
- <TT>/bin</TT>
- may change because of a change in CPU architecture, and
- the networks involved may be different because of differing hardware,
- but the effect feels like simply speeding up the processor in the
- current name space.
- </P>
- <H4>Position
- </H4>
- <P>
- These examples illustrate how the ideas of representing resources
- as file systems and per-process name spaces can be used to solve
- problems often left to more exotic mechanisms.
- Nonetheless there are some operations in Plan 9 that are not
- mapped into file I/O.
- An example is process creation.
- We could imagine a message to a control file in
- <TT>/proc</TT>
- that creates a process, but the details of
- constructing the environment of the new process ­ its open files,
- name space, memory image, etc. ­ are too intricate to
- be described easily in a simple I/O operation.
- Therefore new processes on Plan 9 are created by fairly conventional
- <TT>rfork</TT>
- and
- <TT>exec</TT>
- system calls;
- <TT>/proc</TT>
- is used only to represent and control existing processes.
- </P>
- <P>
- Plan 9 does not attempt to map network name spaces into the file
- system name space, for several reasons.
- The different addressing rules for various networks and protocols
- cannot be mapped uniformly into a hierarchical file name space.
- Even if they could be,
- the various mechanisms to authenticate,
- select a service,
- and control the connection would not map consistently into
- operations on a file.
- </P>
- <P>
- Shared memory is another resource not adequately represented by a
- file name space.
- Plan 9 takes care to provide mechanisms
- to allow groups of local processes to share and map memory.
- Memory is controlled
- by system calls rather than special files, however,
- since a representation in the file system would imply that memory could
- be imported from remote machines.
- </P>
- <P>
- Despite these limitations, file systems and name spaces offer an effective
- model around which to build a distributed system.
- Used well, they can provide a uniform, familiar, transparent
- interface to a diverse set of distributed resources.
- They carry well-understood properties of access, protection,
- and naming.
- The integration of devices into the hierarchical file system
- was the best idea in UNIX.
- Plan 9 pushes the concepts much further and shows that
- file systems, when used inventively, have plenty of scope
- for productive research.
- </P>
- <H4>References
- </H4>
- <br> <br>
- [Killian] T. Killian, ``Processes as Files'', USENIX Summer Conf. Proc., Salt Lake City, 1984
- <br>
- [Needham] R. Needham, ``Names'', in
- Distributed systems,
- S. Mullender, ed.,
- Addison Wesley, 1989
- <br>
- [Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
- ``Plan 9 from Bell Labs'',
- UKUUG Proc. of the Summer 1990 Conf.,
- London, England,
- 1990
- <br>
- [Presotto] D. Presotto, ``Multiprocessor Streams for Plan 9'',
- UKUUG Proc. of the Summer 1990 Conf.,
- London, England,
- 1990
- <br>
- [Pike91] Pike, R., ``8.5, The Plan 9 Window System'', USENIX Summer
- Conf. Proc., Nashville, 1991
- <br> <br>
- <A href=http://www.lucent.com/copyright.html>
- Copyright</A> © 2004 Lucent Technologies Inc. All rights reserved.
- </body></html>
|