123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737 |
- .ds TM \u\s-2TM\s+2\d
- .nr dT 6
- .nr XT 6
- .TL
- The Styx Architecture for Distributed Systems
- .AU
- Rob Pike
- Dennis M. Ritchie
- .AI
- Computing Science Research Center
- Lucent Technologies, Bell Labs
- Murray Hill, New Jersey
- USA
- .FS
- .FA
- Originally appeared in
- .I "Bell Labs Technical Journal" ,
- Vol. 4,
- No. 2,
- April-June 1999,
- pp. 146-152.
- .br
- Copyright © 1999 Lucent Technologies Inc. All rights reserved.
- .FE
- .AB
- A distributed system is constructed from a set of relatively
- independent components that form a unified, but geographically and
- functionally diverse entity. Examples include networked operating
- systems, Internet services, the national telephone
- switching system, and in general
- all the technology using today's diverse digital
- networks. Nevertheless, distributed systems remain difficult
- to design, build, and maintain, primarily because of the lack
- of a clean, perspicuous interconnection model for the
- components.
- .LP
- Our experience with two distributed operating systems,
- Plan 9 and Inferno, encourages us to propose such a model.
- These systems depend on, advocate, and generally push to the
- limit a fruitful idea: to present their
- resources as files in a hierarchical name space.
- The objects appearing as files may represent stored data, but may
- also be devices, dynamic information sources, interfaces to services,
- and control points. The approach unifies and provides basic naming,
- structuring, and access control mechanisms for all system resources.
- A simple underlying network protocol, Styx, forms
- the core of the architecture by presenting a common
- language for communication within the system.
- .LP
- Even within non-distributed systems, the presentation of services
- as files advantageously extends a familiar scheme for naming, classifying,
- and connecting to system resources.
- More important, the approach provides a natural way to build
- distributed systems, by using well-known technology for attaching
- remote file systems.
- If resources are represented as files,
- and there are remote file systems, one has
- a distributed system: resources available in one place
- are usable from another.
- .AE
- .SH
- Introduction
- .LP
- The Styx protocol is a variant of a protocol called
- .I 9P
- that
- was developed for the Plan 9 operating system[9man].
- For simplicity, we will use the name
- Styx throughout this paper; the difference concerns only the initialization of
- a connection.
- .LP
- The original idea behind Styx was to encode file operations between
- client programs and the file system,
- to be translated into messages for transmission on a computer network.
- Using this technology,
- Plan 9 separates the file server\(ema central repository for
- permanent file storage\(emboth from the CPU server\(ema large
- shared-memory multiprocessor\(emand from the user terminals.
- This physical separation of function was central to the original
- design of the system;
- what was unexpected was how well the model could be used to
- solve a wide variety of problems not usually thought of as
- file system issues.
- .LP
- The breakthrough was to realize that by representing
- a computing resource as a form of file system,
- many of the difficulties of making that resource available
- across the network would disappear naturally, because
- Styx could export the resource transparently.
- For example,
- the Plan 9 window system,
- .CW 8½
- [Pike91],
- is implemented as a dynamic file server that publishes
- files with names like
- .CW /dev/mouse
- and
- .CW /dev/screen
- to provide access to the local hardware.
- The
- .CW /dev/mouse
- file, for instance,
- may be opened and read like a regular file, in the manner of UNIX\*(TM device
- files, but under
- .CW 8½
- it is multiplexed: each client program has a private
- .CW /dev/mouse
- file that returns mouse events only when the client's window
- is the active one on the display.
- This design provides a clean, simple mechanism for controlling
- access to the mouse.
- Its real strength, though, is that the representation of the window system's
- resources as files allows Styx to make those resources available across the
- network.
- For example, an interactive graphics program may be run on a CPU server
- simply by having
- .CW 8½
- serve the appropriate files to that machine.
- .LP
- Note that although the resources published by Styx behave like files\(emthey
- have file names, file permissions, and file access methods\(emthey do not
- need to exist as standard files on disk.
- The
- .CW /dev/mouse
- file is accessed by standard file I/O mechanisms but is nonetheless a
- transient object fabricated dynamically by a running program;
- it has no permanent existence.
- .LP
- By following this approach throughout the system, Plan 9 achieves
- a remarkable degree of transparency in the distribution of resources[PPTTW93].
- Besides interactive graphics, services such as debugging, maintenance,
- file backup, and even access to the underlying network hardware
- can be made available across the network using Styx, permitting
- the construction of distributed applications and services
- using nothing more sophisticated than file I/O.
- .SH
- The Styx protocol
- .LP
- Styx's place in the world is analogous to
- Sun NFS[RFC][NFS] or Microsoft CIFS[CIFS], although it is simpler and easier to implement
- [Welc94].
- Furthermore, NFS and CIFS are designed for sharing regular disk files; NFS in particular
- is intimately tied to the implementation and caching strategy
- of the underlying UNIX file system.
- Unlike Styx, NFS and CIFS are clumsier at exporting dynamic device-like
- files such as
- .CW /dev/mouse .
- .LP
- Styx provides a view of a hierarchical, tree-shaped
- file system name space[Nee89], together with access information about
- the files (permissions, sizes, dates) and the means to read and write
- the files.
- Its users (that is, the people who write application programs),
- don't see the protocol itself; instead they see files that they
- read and write, and that provide information or change information.
- .LP
- In use, a Styx
- .I client
- is an entity on one machine that establishes communication with
- another entity, the
- .I server ,
- on the same or another machine.
- The client mechanisms may be built into the operating system, as they
- are in Plan 9 or Inferno[INF1][INF2], or into application libraries;
- a server may be part of the operating system, or just as often
- may be application code on a separate server machine. In any case, the
- client and server entities
- communicate by exchanging messages, and the effect is that the client
- sees a hierarchical file system that exists on the server.
- The Styx protocol is the specification of the messages that are exchanged.
- .LP
- At one level, Styx consists of messages of 13 types for
- .RS
- .IP \(bu
- Starting communication (attaching to a file system);
- .IP \(bu
- Navigating the file system (that is, specifying and
- gaining a handle for a named file);
- .IP \(bu
- Reading and writing a file; and
- .IP \(bu
- Performing file status inquiries and changes
- .RE
- .LP
- However, application writers simply code requests to open, read, or write
- files; a library or the operating system translates the requests
- into the necessary byte sequences transmitted over a communication
- channel. The Styx protocol proper specifies the interpretation of these
- byte sequences. It fits, approximately, at the OSI Session Layer level
- of the ISO standard classification.
- Its specification is independent of most details of machine architecture
- and it has been successfully used among machines of varying instruction
- sets and data layout.
- The protocol is summarized in Table 1.
- .KF
- .TS
- center box;
- l l
- --
- lfCW l.
- Name Description
- attach Authenticate user of connection; return FID
- clone Duplicate FID
- walk Advance FID one level of name hierarchy
- open Check permissions for file I/O
- create Create new file
- read Read contents of file
- write Write contents of file
- close Discard FID
- remove Remove file
- stat Report file state: permissions, etc.
- wstat Modify file state
- error Return error condition for failed operation
- flush Disregard outstanding I/O requests
- .TE
- .ce 100
- .ps -1
- Table 1. Summary of Styx messages.
- .ps
- .ce 0
- .KE
- .LP
- In use, an operation such as
- .P1
- open("/usr/rob/.profile", O_READ);
- .P2
- is translated by the underlying system into a sequence of Styx messages.
- After establishing the initial connection to the
- file server, an
- .CW attach
- message authenticates the user (the person or agent accessing the files) and
- returns an object called a
- .CW FID
- (file ID) that represents the root of the hierarchy on the server.
- When the
- .CW open()
- operation is executed, it proceeds as follows.
- .RS
- .IP \(bu
- A
- .CW clone
- message duplicates the root
- .CW FID ,
- returning a new
- .CW FID
- that can navigate the hierarchy without losing the connection to the root.
- .IP \(bu
- The new
- .CW FID
- is then moved to the file
- .CW /usr/rob/.profile
- by a sequence of
- .CW walk
- messages that step along, one path component at a time
- .CW usr , (
- .CW rob ,
- .CW .profile ).
- .IP \(bu
- Finally, an
- .CW open
- message checks that the user has permission to read the file,
- permitting subsequent
- .CW read
- and
- .CW write
- operations (messages) on the
- .CW FID .
- .IP \(bu
- Once I/O is completed, the
- .CW close
- message will release the
- .CW FID .
- .RE
- .LP
- At a lower level, implementations of Styx depend only on a reliable,
- byte-stream Transport communications layer. For example, it runs over either
- TCP/IP, the standard transmission control protocol
- and Internet protocol,
- or Internet link (IL), which is a sequenced, reliable datagram protocol
- using IP packets.
- It is worth emphasizing, though, that the model does not require the
- existence of a network to join the components; Styx runs fine
- over a Unix pipe or even using shared memory.
- The strength of the approach is not so much how it works over a network
- as that its behavior over a network is identical to its behavior locally.
- .SH
- Architectural approach
- .LP
- Styx, as a file system protocol, is merely a component in a
- more encompassing approach
- to system design: the presentation of resources as files.
- This approach will be discussed using a sequence of examples.
- .SH
- .I "Example: networking
- .LP
- As an example, access to a TCP/IP network in Inferno and Plan 9 systems
- appears as a piece of a file system, with (abbreviated) structure
- as follows[PrWi93]:
- .P1
- /net/
- dns/
- tcp/
- clone
- stats
- 0/
- ctl
- status
- data
- listen
- 1/
- ...
- ...
- ether0/
- 0/
- ctl
- status
- ...
- 1/
- ...
- ...
- .P2
- This represents a file system structure in which one can name, read, and write `files' with
- names like
- .CW /net/dns ,
- .CW /net/tcp/clone ,
- .CW /net/tcp/0/ctl
- and so on;
- there are directories of files
- .CW /net/tcp
- and
- .CW /net/ether0 .
- On the machine that actually has the network interface, all of these
- things that look like files are constructed by the kernel drivers that maintain
- the TCP/IP stack; they are not real files on a disk.
- Operations on the `files' turn into operations sent to the device drivers.
- .LP
- Suppose an application wishes to establish a connection over TCP/IP to
- .CW www.bell-labs.com .
- The first task is to translate the domain name
- .CW www.bell-labs.com
- to a numerical internet address; this is a complicated process, generally
- involving communicating with local and remote Domain Name Servers.
- In the Styx model, this is done by opening the file
- .CW /dev/dns
- and writing the literal string
- .CW www.bell-labs.com
- on the file; then the same file is read.
- It will return the string
- .CW 204.178.16.5
- as a sequence of 12 characters.
- .LP
- Once the numerical Internet address is acquired, the connection must be established;
- this is done by opening
- .CW /net/tcp/clone
- and reading from it a string that specifies a directory like
- .CW /net/tcp/43 ,
- which represents a new, unique TCP/IP channel.
- To establish the connection,
- write a message like
- .CW "connect 204.178.16.5
- on the control file for that connection,
- .CW /net/tcp/43/ctl .
- Subsequently, communication with
- .CW www.bell-labs.com
- is done by reading and
- writing on the file
- .CW /net/tcp/43/data .
- .LP
- There are several things to note about this approach.
- .RS
- .IP \(bu
- All the interface points look like files, and are
- accessed by the same I/O mechanisms already available in
- programming languages like C, C++, or Java. However, they do not
- correspond to ordinary data files on disk, but instead are creations
- of a middleware code layer.
- .IP \(bu
- Communication across the interface, by convention, uses printable character strings where
- feasible instead of binary information. This means that the syntax
- of communication does not depend on CPU architecture or language details.
- .IP \(bu
- Because the interface, as in this example with
- .CW /net
- as the interface with networking facilities, looks like a piece of a
- hierarchical file system, it can easily and nearly automatically
- be exported to a remote machine and used from afar.
- .RE
- .LP
- In particular, the Styx implementation encourages a natural way of providing
- controlled access to networks.
- Lucent, like many organizations, has an internal network not
- accessible to the international Internet, and has a few
- gateways between the inside and outside networks.
- Only the gateway machines are connected to both, and they implement
- the administrative controls for safety and security.
- The advantage of the Styx model is the ease with which
- the outside Internet can be used from inside.
- If the
- .CW /net
- file tree described above is provided on a gateway machine,
- it can be used as a remote file system from machines on the
- inside. This is safe, because this connection is one-way:
- inside machines can see the external network interfaces,
- but outside machines cannot see the inside.
- .SH
- .I "Example: debugging
- .LP
- A similar approach, borrowed and generalized from the UNIX
- system [Kill], is useful for controlling and discovering the status
- of the running processes in the operating system.
- Here a directory
- .CW /proc
- contains a subdirectory for each process running on the
- system; the names of the subdirectories correspond to
- process IDs:
- .P1
- /proc/
- 1/
- status
- ctl
- fd
- text
- mem
- ...
- 2/
- status
- ctl
- ...
- ...
- .P2
- The file names in the process directories refer to various aspects
- of the corresponding process:
- .CW status
- contains information about the state of the process;
- .CW ctl ,
- when written, performs operations like pausing, restarting,
- or killing the process;
- .CW fd
- names and describes the files open in the process;
- .CW text
- and
- .CW mem
- represent the program code and the data respectively.
- .LP
- Where possible, the information and control are again
- represented as text strings. For example, one line
- from the
- .CW status
- file of a typical process might be
- .DS
- .CW "samterm dmr Read 0 20 2478910 0 0 ...
- .DE
- which shows the name of the program, the owner, its state, and several numbers
- representing CPU time in various categories.
- .LP
- Once again, the approach provides several payoffs.
- Because process information is represented in file form,
- remote debugging (debugging programs on another machine)
- is possible immediately by remote-mounting the
- .CW /proc
- tree on another machine.
- The machine-independent representation of information means
- that most operations work properly even if the remote machine
- uses a different CPU architecture from the one doing the
- debugging.
- Most of the programs that deal
- with status and control contain no machine-dependent parts
- and are completely portable.
- (A few are not, however: no attempt is made to render the
- memory data or instructions in machine-independent form.)
- .SH
- .I "Example: PathStar\*(TM Access Server
- .LP
- The data shelf of Lucent's PathStar Access Server[PATH] uses Styx to connect
- the line cards and other devices on the shelf to the control computer.
- In fact, Styx is the protocol for high-level communication on the backplane.
- .LP
- The file system hierarchy served by the control computer includes a structure
- like this:
- .P1
- /trip/
- config
- admin/
- ospfctl
- ...
- boot/
- 0/
- ctl
- eeprom
- memory
- msg
- pack
- alarm
- ...
- 1/
- ...
- /net/
- ...
- .P2
- The directories under
- .CW /net
- are similar to those in Plan 9 or Inferno; they form the interface to the
- external IP network.
- The
- .CW /trip
- hierarchy represents the control structure of the shelf.
- .LP
- The subdirectories under
- .CW /trip/boot
- each provide access to one of the line cards or other devices in the shelf.
- For example, to initialize a card one writes the text string
- .CW reset
- to the
- .CW ctl
- file of the card, while bootstrapping is done by copying the control
- software for the card into the
- .CW memory
- file and writing a
- .CW reset
- message to
- .CW ctl .
- Once the line card is running,
- the other files present an interface to the higher-level structure of the device:
- .CW pack
- is the port through which IP packets are transferred to and from the card,
- .CW alarm
- may be read to discover outstanding conditions on the card, and so on.
- .LP
- All this structure is exported from the shelf using Styx.
- The external element management software (EMS) controls and monitors the
- shelf using Styx operations.
- For example, the EMS may read
- .CW /trip/boot/7/alarm
- and discover a diagnostic condition.
- By reading and writing the other files under
- .CW /trip/boot/7/ ,
- the card may be taken off line, diagnosed, and perhaps reset or substituted,
- all from the system running the EMS, which may be elsewhere in the network.
- .LP
- Another example is the implementation of SNMP in the PathStar Access Server.
- The functionality of SNMP is usually distributed through the various components
- of a network, but here it is a straightforward adaption process,
- running anywhere in the network, that translates SNMP requests to Styx
- operations in the network element.
- Besides dramatically simplifying the implementation, the natural
- ability for aggregation permits
- a single process to provide SNMP access to an arbitrarily complex network subsystem.
- Yet the structure is secure: the file-oriented nature of the operations make it
- easy to establish standard authentication and security controls to guarantee
- that only trusted parties have access to the SNMP operations.
- .LP
- There are local benefits to this architecture, as well.
- Styx provides a single point in the design where control can be separated
- from the details of the underlying fabric, isolating both from changes in the
- other. Components become more adaptable: software can be upgraded
- without worrying about hidden dependencies on the hardware,
- and new hardware may be installed without updating the control
- software above.
- .SH
- Security issues
- .LP
- Styx provides several security mechanisms for
- discouraging hostile or accidental actions that injure the integrity
- of a system.
- .LP
- The underlying file-communication protocol includes
- user and group identifiers that a server may check against
- other authentication.
- For example, a server may check, on a request to open a file,
- that the user ID associated with the request is permitted to
- perform the operation.
- This mechanism is familiar from general-purpose operating
- systems, and its use is well-known.
- It depends on passwords or stronger mechanisms for authenticating
- the identity of clients.
- .LP
- The Styx approach of providing remote resources
- as file systems over a network encourages genuinely secure access
- to the resources in a way transparent to applications, so that
- authentication transactions need not be provided as part of each.
- For example, in Inferno, the negotiation of an initial connection
- between client and server may include installation of any of
- several encrypting or message-digesting protocols that
- supervise the channel.
- All application use of the resources provided by the server
- is then protected against interference, and the server
- has strong assurance that its facilities are being used in
- an authorized way.
- This is relevant both for general-purpose file servers,
- and, in the telephony field, is especially useful for safe
- remote administration.
- .SH
- Summary
- .LP
- Presentation of resources as a piece of a possibly remote file system
- is an attractive way of creating distributed systems that treads a
- path between two extremes:
- .IP 1
- All communication with other parts of the system is by
- explicit messages sent between components.
- This communication differs in style from applications' use
- of local resources.
- .IP 2
- All communication is by means of
- closely shared resources: the CPU-addressable memory in
- various parts is made directly available across a big network;
- applications can read and write far-away objects exactly as
- they do those on the same motherboard as their own CPU.
- .LP
- Something like the first of these extremes is usually more evident
- in today's systems, although either the operating system or software
- layered upon it usually paper over some of the rough spots.
- The second remains more difficult to approach, because
- networks (especially big ones like the Internet) are not very
- reliable, and because
- the machines on them are diverse in processor architecture
- and in installed software.
- .LP
- The design plan described and advocated in this paper
- lies between the two extremes.
- It has these advantages:
- .IP \(bu
- .I "A simple, familiar programming model for reading and writing named files" .
- File systems have well-defined naming, access, and permissions structures.
- .IP \(bu
- .I "Platform and language independence" .
- Underlying access to resources is
- at the file level, which is provided nearly everywhere, instead
- of depending on facilities available only with particular languages
- or operating systems.
- C++ or Java classes, and C libraries can be constructed
- to access the facilities.
- .IP \(bu
- .I "A hierarchical naming and access control structure" .
- This encourages clean
- and well-structured design of resource naming and access.
- .IP \(bu
- .I "Easy testing and debugging" .
- By using well-specified, narrow interfaces
- at the file level, it is straightforward to observe the communication
- between distributed entities.
- .IP \(bu
- .I "Low cost" .
- Support software, at both client and server,
- can be written in a few thousand lines
- of code, and will occupy only small space in products.
- .LP
- This approach to building systems is successful in the general-purpose
- systems Plan 9 and Inferno;
- it has also been used to construct systems specialized for telephony, such
- as Mantra[MAN] and the PathStar Access Server.
- It supplies a coherent, extensible structure both to the internal communications
- within a single system and external communication between heterogeneous
- components of a large digital network.
- .LP
- .SH
- References
- .nr PS -1
- .nr VS -1
- .IP [NFS] 11
- R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and
- B. Lyon,
- ``Design and Implementation of the Sun Network File System'',
- .I "Proc. Summer 1985 USENIX Conf." ,
- Portland, Oregon, June 1985,
- pp. 119-130.
- .IP [RFC] 11
- Internet RFC 1094.
- .IP [9man] 11
- .I "Plan 9 Programmer's Manual" ,
- Second Edition,
- Vol. 1 and 2,
- Bell Laboratories,
- Murray Hill, N.J.,
- 1995.
- .IP [Kill84] 11
- T. J. Killian,
- ``Processes as Files'',
- .I "Proc. Summer 1984 USENIX Conf." ,
- June 1984, Salt Lake City, Utah, June 1984, pp. 203-207.
- .IP [Pike91] 11
- R. Pike,
- ``8½, the Plan 9 Window System'',
- .I "Proc. Summer 1991 USENIX Conf." ,
- Nashville TN, June 1991, pp. 257-265.
- .IP "[PPTTW93] " 11
- R. Pike, D.L. Presotto, K. Thompson, H. Trickey, and P. Winterbottom, ``The Use of Name Spaces in Plan 9'',
- .I "Op. Sys. Rev." ,
- Vol. 27, No. 2, April 1993, pp. 72-76.
- .IP [PrWi93] 11
- D. L. Presotto and P. Winterbottom,
- ``The Organization of Networks in Plan 9'',
- .I "Proc. Winter 1993 USENIX Conf." ,
- San Diego, Calif., Jan. 1993, pp. 43-50.
- .IP [Nee89] 11
- R. Needham, ``Names'', in
- .I "Distributed systems" ,
- edited by S. Mullender,
- Addison-Wesley,
- Reading, Mass., 1989, pp. 89-101.
- .IP [CIFS]
- Paul Leach and Dan Perry, ``CIFS: A Common Internet File System'', Nov. 1996,
- .I "http://www.microsoft.com/mind/1196/cifs.htm" .
- .IP [INF1]
- .I "Inferno Programmer's Manual",
- Third Edition,
- Vol. 1 and 2, Vita Nuova Holdings Limited, York, England, 2000.
- .IP [INF2]
- S.M. Dorward, R. Pike, D. L. Presotto, D. M. Ritchie, H. Trickey,
- and P. Winterbottom, ``The Inferno Operating System'',
- .I "Bell Labs Technical Journal"
- Vol. 2,
- No. 1,
- Winter 1997.
- .IP [MAN]
- R. A. Lakshmi-Ratan,
- ``The Lucent Technologies Softswitch\-Realizing the Promise of Convergence'',
- .I "Bell Labs Technical Journal" ,
- Vol. 4,
- No. 2,
- April-June 1999,
- pp. 174-196.
- .IP [PATH]
- J. M. Fossaceca, J. D. Sandoz, and P. Winterbottom,
- ``The PathStar Access Server: Facilitating Carrier-Scale Packet Telephony'',
- .I "Bell Labs Technical Journal" ,
- Vol. 3,
- No. 4,
- October-December 1998,
- pp. 86-102.
- .IP [Welc94]
- B. Welch,
- ``A Comparison of Three Distributed File System Architectures: Vnode, Sprite, and Plan 9'',
- .I "Computing Systems" ,
- Vol. 7, No. 2, pp. 175-199 (1994).
- .nr PS +1
- .nr VS +1
|