12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190119111921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259126012611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319132013211322132313241325132613271328132913301331133213331334133513361337133813391340134113421343134413451346134713481349135013511352135313541355135613571358135913601361136213631364136513661367136813691370137113721373137413751376137713781379 |
- <html>
- <title>
- data
- </title>
- <body BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#330088" ALINK="#FF0044">
- <H1>The Organization of Networks in Plan 9
- </H1>
- <DL><DD><I>Dave Presotto<br>
- Phil Winterbottom<br>
- <br> <br>
- presotto,philw@plan9.bell-labs.com<br>
- </I></DL>
- <DL><DD><H4>ABSTRACT</H4>
- <DL>
- <DT><DT> <DD>
- NOTE:<I> Originally appeared in
- Proc. of the Winter 1993 USENIX Conf.,
- pp. 271-280,
- San Diego, CA
- </I><DT> <DD></dl>
- <br>
- In a distributed system networks are of paramount importance. This
- paper describes the implementation, design philosophy, and organization
- of network support in Plan 9. Topics include network requirements
- for distributed systems, our kernel implementation, network naming, user interfaces,
- and performance. We also observe that much of this organization is relevant to
- current systems.
- </DL>
- <H4>1 Introduction
- </H4>
- <P>
- Plan 9 [Pike90] is a general-purpose, multi-user, portable distributed system
- implemented on a variety of computers and networks.
- What distinguishes Plan 9 is its organization.
- The goals of this organization were to
- reduce administration
- and to promote resource sharing. One of the keys to its success as a distributed
- system is the organization and management of its networks.
- </P>
- <P>
- A Plan 9 system comprises file servers, CPU servers and terminals.
- The file servers and CPU servers are typically centrally
- located multiprocessor machines with large memories and
- high speed interconnects.
- A variety of workstation-class machines
- serve as terminals
- connected to the central servers using several networks and protocols.
- The architecture of the system demands a hierarchy of network
- speeds matching the needs of the components.
- Connections between file servers and CPU servers are high-bandwidth point-to-point
- fiber links.
- Connections from the servers fan out to local terminals
- using medium speed networks
- such as Ethernet [Met80] and Datakit [Fra80].
- Low speed connections via the Internet and
- the AT&T backbone serve users in Oregon and Illinois.
- Basic Rate ISDN data service and 9600 baud serial lines provide slow
- links to users at home.
- </P>
- <P>
- Since CPU servers and terminals use the same kernel,
- users may choose to run programs locally on
- their terminals or remotely on CPU servers.
- The organization of Plan 9 hides the details of system connectivity
- allowing both users and administrators to configure their environment
- to be as distributed or centralized as they wish.
- Simple commands support the
- construction of a locally represented name space
- spanning many machines and networks.
- At work, users tend to use their terminals like workstations,
- running interactive programs locally and
- reserving the CPU servers for data or compute intensive jobs
- such as compiling and computing chess endgames.
- At home or when connected over
- a slow network, users tend to do most work on the CPU server to minimize
- traffic on the slow links.
- The goal of the network organization is to provide the same
- environment to the user wherever resources are used.
- </P>
- <H4>2 Kernel Network Support
- </H4>
- <P>
- Networks play a central role in any distributed system. This is particularly
- true in Plan 9 where most resources are provided by servers external to the kernel.
- The importance of the networking code within the kernel
- is reflected by its size;
- of 25,000 lines of kernel code, 12,500 are network and protocol related.
- Networks are continually being added and the fraction of code
- devoted to communications
- is growing.
- Moreover, the network code is complex.
- Protocol implementations consist almost entirely of
- synchronization and dynamic memory management, areas demanding
- subtle error recovery
- strategies.
- The kernel currently supports Datakit, point-to-point fiber links,
- an Internet (IP) protocol suite and ISDN data service.
- The variety of networks and machines
- has raised issues not addressed by other systems running on commercial
- hardware supporting only Ethernet or FDDI.
- </P>
- <H4>2.1 The File System protocol
- </H4>
- <P>
- A central idea in Plan 9 is the representation of a resource as a hierarchical
- file system.
- Each process assembles a view of the system by building a
- <I>name space</I>
- [Needham] connecting its resources.
- File systems need not represent disc files; in fact, most Plan 9 file systems have no
- permanent storage.
- A typical file system dynamically represents
- some resource like a set of network connections or the process table.
- Communication between the kernel, device drivers, and local or remote file servers uses a
- protocol called 9P. The protocol consists of 17 messages
- describing operations on files and directories.
- Kernel resident device and protocol drivers use a procedural version
- of the protocol while external file servers use an RPC form.
- Nearly all traffic between Plan 9 systems consists
- of 9P messages.
- 9P relies on several properties of the underlying transport protocol.
- It assumes messages arrive reliably and in sequence and
- that delimiters between messages
- are preserved.
- When a protocol does not meet these
- requirements (for example, TCP does not preserve delimiters)
- we provide mechanisms to marshal messages before handing them
- to the system.
- </P>
- <P>
- A kernel data structure, the
- <I>channel</I>,
- is a handle to a file server.
- Operations on a channel generate the following 9P messages.
- The
- <TT>session</TT>
- and
- <TT>attach</TT>
- messages authenticate a connection, established by means external to 9P,
- and validate its user.
- The result is an authenticated
- channel
- referencing the root of the
- server.
- The
- <TT>clone</TT>
- message makes a new channel identical to an existing channel, much like
- the
- <TT>dup</TT>
- system call.
- A
- channel
- may be moved to a file on the server using a
- <TT>walk</TT>
- message to descend each level in the hierarchy.
- The
- <TT>stat</TT>
- and
- <TT>wstat</TT>
- messages read and write the attributes of the file referenced by a channel.
- The
- <TT>open</TT>
- message prepares a channel for subsequent
- <TT>read</TT>
- and
- <TT>write</TT>
- messages to access the contents of the file.
- <TT>Create</TT>
- and
- <TT>remove</TT>
- perform the actions implied by their names on the file
- referenced by the channel.
- The
- <TT>clunk</TT>
- message discards a channel without affecting the file.
- </P>
- <P>
- A kernel resident file server called the
- <I>mount driver</I>
- converts the procedural version of 9P into RPCs.
- The
- <I>mount</I>
- system call provides a file descriptor, which can be
- a pipe to a user process or a network connection to a remote machine, to
- be associated with the mount point.
- After a mount, operations
- on the file tree below the mount point are sent as messages to the file server.
- The
- mount
- driver manages buffers, packs and unpacks parameters from
- messages, and demultiplexes among processes using the file server.
- </P>
- <H4>2.2 Kernel Organization
- </H4>
- <P>
- The network code in the kernel is divided into three layers: hardware interface,
- protocol processing, and program interface.
- A device driver typically uses streams to connect the two interface layers.
- Additional stream modules may be pushed on
- a device to process protocols.
- Each device driver is a kernel-resident file system.
- Simple device drivers serve a single level
- directory containing just a few files;
- for example, we represent each UART
- by a data and a control file.
- <DL><DT><DD><TT><PRE>
- cpu% cd /dev
- cpu% ls -l eia*
- --rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1
- --rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia1ctl
- --rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2
- --rw-rw-rw- t 0 bootes bootes 0 Jul 16 17:28 eia2ctl
- cpu%
- </PRE></TT></DL>
- The control file is used to control the device;
- writing the string
- <TT>b1200</TT>
- to
- <TT>/dev/eia1ctl</TT>
- sets the line to 1200 baud.
- </P>
- <P>
- Multiplexed devices present
- a more complex interface structure.
- For example, the LANCE Ethernet driver
- serves a two level file tree (Figure 1)
- providing
- </P>
- <DL COMPACT>
- <DT>*<DD>
- device control and configuration
- <DT>*<DD>
- user-level protocols like ARP
- <DT>*<DD>
- diagnostic interfaces for snooping software.
- </dl>
- <br> <br>
- The top directory contains a
- <TT>clone</TT>
- file and a directory for each connection, numbered
- <TT>1</TT>
- to
- <TT>n</TT>.
- Each connection directory corresponds to an Ethernet packet type.
- Opening the
- <TT>clone</TT>
- file finds an unused connection directory
- and opens its
- <TT>ctl</TT>
- file.
- Reading the control file returns the ASCII connection number; the user
- process can use this value to construct the name of the proper
- connection directory.
- In each connection directory files named
- <TT>ctl</TT>,
- <TT>data</TT>,
- <TT>stats</TT>,
- and
- <TT>type</TT>
- provide access to the connection.
- Writing the string
- <TT>connect 2048</TT>
- to the
- <TT>ctl</TT>
- file sets the packet type to 2048
- and
- configures the connection to receive
- all IP packets sent to the machine.
- Subsequent reads of the file
- <TT>type</TT>
- yield the string
- <TT>2048</TT>.
- The
- <TT>data</TT>
- file accesses the media;
- reading it
- returns the
- next packet of the selected type.
- Writing the file
- queues a packet for transmission after
- appending a packet header containing the source address and packet type.
- The
- <TT>stats</TT>
- file returns ASCII text containing the interface address,
- packet input/output counts, error statistics, and general information
- about the state of the interface.
- <DL><DT><DD><TT><PRE>
- <br><img src="data.7580.gif"><br>
- </PRE></TT></DL>
- If several connections on an interface
- are configured for a particular packet type, each receives a
- copy of the incoming packets.
- The special packet type
- <TT>-1</TT>
- selects all packets.
- Writing the strings
- <TT>promiscuous</TT>
- and
- <TT>connect</TT>
- <TT>-1</TT>
- to the
- <TT>ctl</TT>
- file
- configures a conversation to receive all packets on the Ethernet.
- <P>
- Although the driver interface may seem elaborate,
- the representation of a device as a set of files using ASCII strings for
- communication has several advantages.
- Any mechanism supporting remote access to files immediately
- allows a remote machine to use our interfaces as gateways.
- Using ASCII strings to control the interface avoids byte order problems and
- ensures a uniform representation for
- devices on the same machine and even allows devices to be accessed remotely.
- Representing dissimilar devices by the same set of files allows common tools
- to serve
- several networks or interfaces.
- Programs like
- <TT>stty</TT>
- are replaced by
- <TT>echo</TT>
- and shell redirection.
- </P>
- <H4>2.3 Protocol devices
- </H4>
- <P>
- Network connections are represented as pseudo-devices called protocol devices.
- Protocol device drivers exist for the Datakit URP protocol and for each of the
- Internet IP protocols TCP, UDP, and IL.
- IL, described below, is a new communication protocol used by Plan 9 for
- transmitting file system RPC's.
- All protocol devices look identical so user programs contain no
- network-specific code.
- </P>
- <P>
- Each protocol device driver serves a directory structure
- similar to that of the Ethernet driver.
- The top directory contains a
- <TT>clone</TT>
- file and a directory for each connection numbered
- <TT>0</TT>
- to
- <TT>n</TT>.
- Each connection directory contains files to control one
- connection and to send and receive information.
- A TCP connection directory looks like this:
- <DL><DT><DD><TT><PRE>
- cpu% cd /net/tcp/2
- cpu% ls -l
- --rw-rw---- I 0 ehg bootes 0 Jul 13 21:14 ctl
- --rw-rw---- I 0 ehg bootes 0 Jul 13 21:14 data
- --rw-rw---- I 0 ehg bootes 0 Jul 13 21:14 listen
- --r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 local
- --r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 remote
- --r--r--r-- I 0 bootes bootes 0 Jul 13 21:14 status
- cpu% cat local remote status
- 135.104.9.31 5012
- 135.104.53.11 564
- tcp/2 1 Established connect
- cpu%
- </PRE></TT></DL>
- The files
- <TT>local</TT>,
- <TT>remote</TT>,
- and
- <TT>status</TT>
- supply information about the state of the connection.
- The
- <TT>data</TT>
- and
- <TT>ctl</TT>
- files
- provide access to the process end of the stream implementing the protocol.
- The
- <TT>listen</TT>
- file is used to accept incoming calls from the network.
- </P>
- <P>
- The following steps establish a connection.
- </P>
- <DL COMPACT>
- <DT>1)<DD>
- The clone device of the
- appropriate protocol directory is opened to reserve an unused connection.
- <DT>2)<DD>
- The file descriptor returned by the open points to the
- <TT>ctl</TT>
- file of the new connection.
- Reading that file descriptor returns an ASCII string containing
- the connection number.
- <DT>3)<DD>
- A protocol/network specific ASCII address string is written to the
- <TT>ctl</TT>
- file.
- <DT>4)<DD>
- The path of the
- <TT>data</TT>
- file is constructed using the connection number.
- When the
- <TT>data</TT>
- file is opened the connection is established.
- </dl>
- <br> <br>
- A process can read and write this file descriptor
- to send and receive messages from the network.
- If the process opens the
- <TT>listen</TT>
- file it blocks until an incoming call is received.
- An address string written to the
- <TT>ctl</TT>
- file before the listen selects the
- ports or services the process is prepared to accept.
- When an incoming call is received, the open completes
- and returns a file descriptor
- pointing to the
- <TT>ctl</TT>
- file of the new connection.
- Reading the
- <TT>ctl</TT>
- file yields a connection number used to construct the path of the
- <TT>data</TT>
- file.
- A connection remains established while any of the files in the connection directory
- are referenced or until a close is received from the network.
- <H4>2.4 Streams
- </H4>
- <P>
- A
- <I>stream</I>
- [Rit84a][Presotto] is a bidirectional channel connecting a
- physical or pseudo-device to user processes.
- The user processes insert and remove data at one end of the stream.
- Kernel processes acting on behalf of a device insert data at
- the other end.
- Asynchronous communications channels such as pipes,
- TCP conversations, Datakit conversations, and RS232 lines are implemented using
- streams.
- </P>
- <P>
- A stream comprises a linear list of
- <I>processing modules</I>.
- Each module has both an upstream (toward the process) and
- downstream (toward the device)
- <I>put routine</I>.
- Calling the put routine of the module on either end of the stream
- inserts data into the stream.
- Each module calls the succeeding one to send data up or down the stream.
- </P>
- <P>
- An instance of a processing module is represented by a pair of
- <I>queues</I>,
- one for each direction.
- The queues point to the put procedures and can be used
- to queue information traveling along the stream.
- Some put routines queue data locally and send it along the stream at some
- later time, either due to a subsequent call or an asynchronous
- event such as a retransmission timer or a device interrupt.
- Processing modules create helper kernel processes to
- provide a context for handling asynchronous events.
- For example, a helper kernel process awakens periodically
- to perform any necessary TCP retransmissions.
- The use of kernel processes instead of serialized run-to-completion service routines
- differs from the implementation of Unix streams.
- Unix service routines cannot
- use any blocking kernel resource and they lack a local long-lived state.
- Helper kernel processes solve these problems and simplify the stream code.
- </P>
- <P>
- There is no implicit synchronization in our streams.
- Each processing module must ensure that concurrent processes using the stream
- are synchronized.
- This maximizes concurrency but introduces the
- possibility of deadlock.
- However, deadlocks are easily avoided by careful programming; to
- date they have not caused us problems.
- </P>
- <P>
- Information is represented by linked lists of kernel structures called
- <I>blocks</I>.
- Each block contains a type, some state flags, and pointers to
- an optional buffer.
- Block buffers can hold either data or control information, i.e., directives
- to the processing modules.
- Blocks and block buffers are dynamically allocated from kernel memory.
- </P>
- <H4>2.4.1 User Interface
- </H4>
- <P>
- A stream is represented at user level as two files,
- <TT>ctl</TT>
- and
- <TT>data</TT>.
- The actual names can be changed by the device driver using the stream,
- as we saw earlier in the example of the UART driver.
- The first process to open either file creates the stream automatically.
- The last close destroys it.
- Writing to the
- <TT>data</TT>
- file copies the data into kernel blocks
- and passes them to the downstream put routine of the first processing module.
- A write of less than 32K is guaranteed to be contained by a single block.
- Concurrent writes to the same stream are not synchronized, although the
- 32K block size assures atomic writes for most protocols.
- The last block written is flagged with a delimiter
- to alert downstream modules that care about write boundaries.
- In most cases the first put routine calls the second, the second
- calls the third, and so on until the data is output.
- As a consequence, most data is output without context switching.
- </P>
- <P>
- Reading from the
- <TT>data</TT>
- file returns data queued at the top of the stream.
- The read terminates when the read count is reached
- or when the end of a delimited block is encountered.
- A per stream read lock ensures only one process
- can read from a stream at a time and guarantees
- that the bytes read were contiguous bytes from the
- stream.
- </P>
- <P>
- Like UNIX streams [Rit84a],
- Plan 9 streams can be dynamically configured.
- The stream system intercepts and interprets
- the following control blocks:
- </P>
- <DL COMPACT>
- <DT><TT>push</TT> <I>name</I><DD>
- adds an instance of the processing module
- <I>name</I>
- to the top of the stream.
- <DT><TT>pop</TT><DD>
- removes the top module of the stream.
- <DT><TT>hangup</TT><DD>
- sends a hangup message
- up the stream from the device end.
- </dl>
- <br> <br>
- Other control blocks are module-specific and are interpreted by each
- processing module
- as they pass.
- <P>
- The convoluted syntax and semantics of the UNIX
- <TT>ioctl</TT>
- system call convinced us to leave it out of Plan 9.
- Instead,
- <TT>ioctl</TT>
- is replaced by the
- <TT>ctl</TT>
- file.
- Writing to the
- <TT>ctl</TT>
- file
- is identical to writing to a
- <TT>data</TT>
- file except the blocks are of type
- <I>control</I>.
- A processing module parses each control block it sees.
- Commands in control blocks are ASCII strings, so
- byte ordering is not an issue when one system
- controls streams in a name space implemented on another processor.
- The time to parse control blocks is not important, since control
- operations are rare.
- </P>
- <H4>2.4.2 Device Interface
- </H4>
- <P>
- The module at the downstream end of the stream is part of a device interface.
- The particulars of the interface vary with the device.
- Most device interfaces consist of an interrupt routine, an output
- put routine, and a kernel process.
- The output put routine stages data for the
- device and starts the device if it is stopped.
- The interrupt routine wakes up the kernel process whenever
- the device has input to be processed or needs more output staged.
- The kernel process puts information up the stream or stages more data for output.
- The division of labor among the different pieces varies depending on
- how much must be done at interrupt level.
- However, the interrupt routine may not allocate blocks or call
- a put routine since both actions require a process context.
- </P>
- <H4>2.4.3 Multiplexing
- </H4>
- <P>
- The conversations using a protocol device must be
- multiplexed onto a single physical wire.
- We push a multiplexer processing module
- onto the physical device stream to group the conversations.
- The device end modules on the conversations add the necessary header
- onto downstream messages and then put them to the module downstream
- of the multiplexer.
- The multiplexing module looks at each message moving up its stream and
- puts it to the correct conversation stream after stripping
- the header controlling the demultiplexing.
- </P>
- <P>
- This is similar to the Unix implementation of multiplexer streams.
- The major difference is that we have no general structure that
- corresponds to a multiplexer.
- Each attempt to produce a generalized multiplexer created a more complicated
- structure and underlined the basic difficulty of generalizing this mechanism.
- We now code each multiplexer from scratch and favor simplicity over
- generality.
- </P>
- <H4>2.4.4 Reflections
- </H4>
- <P>
- Despite five year's experience and the efforts of many programmers,
- we remain dissatisfied with the stream mechanism.
- Performance is not an issue;
- the time to process protocols and drive
- device interfaces continues to dwarf the
- time spent allocating, freeing, and moving blocks
- of data.
- However the mechanism remains inordinately
- complex.
- Much of the complexity results from our efforts
- to make streams dynamically configurable, to
- reuse processing modules on different devices
- and to provide kernel synchronization
- to ensure data structures
- don't disappear under foot.
- This is particularly irritating since we seldom use these properties.
- </P>
- <P>
- Streams remain in our kernel because we are unable to
- devise a better alternative.
- Larry Peterson's X-kernel [Pet89a]
- is the closest contender but
- doesn't offer enough advantage to switch.
- If we were to rewrite the streams code, we would probably statically
- allocate resources for a large fixed number of conversations and burn
- memory in favor of less complexity.
- </P>
- <H4>3 The IL Protocol
- </H4>
- <P>
- None of the standard IP protocols is suitable for transmission of
- 9P messages over an Ethernet or the Internet.
- TCP has a high overhead and does not preserve delimiters.
- UDP, while cheap, does not provide reliable sequenced delivery.
- Early versions of the system used a custom protocol that was
- efficient but unsatisfactory for internetwork transmission.
- When we implemented IP, TCP, and UDP we looked around for a suitable
- replacement with the following properties:
- </P>
- <DL COMPACT>
- <DT>*<DD>
- Reliable datagram service with sequenced delivery
- <DT>*<DD>
- Runs over IP
- <DT>*<DD>
- Low complexity, high performance
- <DT>*<DD>
- Adaptive timeouts
- </dl>
- <br> <br>
- None met our needs so a new protocol was designed.
- IL is a lightweight protocol designed to be encapsulated by IP.
- It is a connection-based protocol
- providing reliable transmission of sequenced messages between machines.
- No provision is made for flow control since the protocol is designed to transport RPC
- messages between client and server.
- A small outstanding message window prevents too
- many incoming messages from being buffered;
- messages outside the window are discarded
- and must be retransmitted.
- Connection setup uses a two way handshake to generate
- initial sequence numbers at each end of the connection;
- subsequent data messages increment the
- sequence numbers allowing
- the receiver to resequence out of order messages.
- In contrast to other protocols, IL does not do blind retransmission.
- If a message is lost and a timeout occurs, a query message is sent.
- The query message is a small control message containing the current
- sequence numbers as seen by the sender.
- The receiver responds to a query by retransmitting missing messages.
- This allows the protocol to behave well in congested networks,
- where blind retransmission would cause further
- congestion.
- Like TCP, IL has adaptive timeouts.
- A round-trip timer is used
- to calculate acknowledge and retransmission times in terms of the network speed.
- This allows the protocol to perform well on both the Internet and on local Ethernets.
- <P>
- In keeping with the minimalist design of the rest of the kernel, IL is small.
- The entire protocol is 847 lines of code, compared to 2200 lines for TCP.
- IL is our protocol of choice.
- </P>
- <H4>4 Network Addressing
- </H4>
- <P>
- A uniform interface to protocols and devices is not sufficient to
- support the transparency we require.
- Since each network uses a different
- addressing scheme,
- the ASCII strings written to a control file have no common format.
- As a result, every tool must know the specifics of the networks it
- is capable of addressing.
- Moreover, since each machine supplies a subset
- of the available networks, each user must be aware of the networks supported
- by every terminal and server machine.
- This is obviously unacceptable.
- </P>
- <P>
- Several possible solutions were considered and rejected; one deserves
- more discussion.
- We could have used a user-level file server
- to represent the network name space as a Plan 9 file tree.
- This global naming scheme has been implemented in other distributed systems.
- The file hierarchy provides paths to
- directories representing network domains.
- Each directory contains
- files representing the names of the machines in that domain;
- an example might be the path
- <TT>/net/name/usa/edu/mit/ai</TT>.
- Each machine file contains information like the IP address of the machine.
- We rejected this representation for several reasons.
- First, it is hard to devise a hierarchy encompassing all representations
- of the various network addressing schemes in a uniform manner.
- Datakit and Ethernet address strings have nothing in common.
- Second, the address of a machine is
- often only a small part of the information required to connect to a service on
- the machine.
- For example, the IP protocols require symbolic service names to be mapped into
- numeric port numbers, some of which are privileged and hence special.
- Information of this sort is hard to represent in terms of file operations.
- Finally, the size and number of the networks being represented burdens users with
- an unacceptably large amount of information about the organization of the network
- and its connectivity.
- In this case the Plan 9 representation of a
- resource as a file is not appropriate.
- </P>
- <P>
- If tools are to be network independent, a third-party server must resolve
- network names.
- A server on each machine, with local knowledge, can select the best network
- for any particular destination machine or service.
- Since the network devices present a common interface,
- the only operation which differs between networks is name resolution.
- A symbolic name must be translated to
- the path of the clone file of a protocol
- device and an ASCII address string to write to the
- <TT>ctl</TT>
- file.
- A connection server (CS) provides this service.
- </P>
- <H4>4.1 Network Database
- </H4>
- <P>
- On most systems several
- files such as
- <TT>/etc/hosts</TT>,
- <TT>/etc/networks</TT>,
- <TT>/etc/services</TT>,
- <TT>/etc/hosts.equiv</TT>,
- <TT>/etc/bootptab</TT>,
- and
- <TT>/etc/named.d</TT>
- hold network information.
- Much time and effort is spent
- administering these files and keeping
- them mutually consistent.
- Tools attempt to
- automatically derive one or more of the files from
- information in other files but maintenance continues to be
- difficult and error prone.
- </P>
- <P>
- Since we were writing an entirely new system, we were free to
- try a simpler approach.
- One database on a shared server contains all the information
- needed for network administration.
- Two ASCII files comprise the main database:
- <TT>/lib/ndb/local</TT>
- contains locally administered information and
- <TT>/lib/ndb/global</TT>
- contains information imported from elsewhere.
- The files contain sets of attribute/value pairs of the form
- <I>attr<TT>=</TT>value</I>,
- where
- <I>attr</I>
- and
- <I>value</I>
- are alphanumeric strings.
- Systems are described by multi-line entries;
- a header line at the left margin begins each entry followed by zero or more
- indented attribute/value pairs specifying
- names, addresses, properties, etc.
- For example, the entry for our CPU server
- specifies a domain name, an IP address, an Ethernet address,
- a Datakit address, a boot file, and supported protocols.
- <DL><DT><DD><TT><PRE>
- sys = helix
- dom=helix.research.bell-labs.com
- bootf=/mips/9power
- ip=135.104.9.31 ether=0800690222f0
- dk=nj/astro/helix
- proto=il flavor=9cpu
- </PRE></TT></DL>
- If several systems share entries such as
- network mask and gateway, we specify that information
- with the network or subnetwork instead of the system.
- The following entries define a Class B IP network and
- a few subnets derived from it.
- The entry for the network specifies the IP mask,
- file system, and authentication server for all systems
- on the network.
- Each subnetwork specifies its default IP gateway.
- <DL><DT><DD><TT><PRE>
- ipnet=mh-astro-net ip=135.104.0.0 ipmask=255.255.255.0
- fs=bootes.research.bell-labs.com
- auth=1127auth
- ipnet=unix-room ip=135.104.117.0
- ipgw=135.104.117.1
- ipnet=third-floor ip=135.104.51.0
- ipgw=135.104.51.1
- ipnet=fourth-floor ip=135.104.52.0
- ipgw=135.104.52.1
- </PRE></TT></DL>
- Database entries also define the mapping of service names
- to port numbers for TCP, UDP, and IL.
- <DL><DT><DD><TT><PRE>
- tcp=echo port=7
- tcp=discard port=9
- tcp=systat port=11
- tcp=daytime port=13
- </PRE></TT></DL>
- </P>
- <P>
- All programs read the database directly so
- consistency problems are rare.
- However the database files can become large.
- Our global file, containing all information about
- both Datakit and Internet systems in AT&T, has 43,000
- lines.
- To speed searches, we build hash table files for each
- attribute we expect to search often.
- The hash file entries point to entries
- in the master files.
- Every hash file contains the modification time of its master
- file so we can avoid using an out-of-date hash table.
- Searches for attributes that aren't hashed or whose hash table
- is out-of-date still work, they just take longer.
- </P>
- <H4>4.2 Connection Server
- </H4>
- <P>
- On each system a user level connection server process, CS, translates
- symbolic names to addresses.
- CS uses information about available networks, the network database, and
- other servers (such as DNS) to translate names.
- CS is a file server serving a single file,
- <TT>/net/cs</TT>.
- A client writes a symbolic name to
- <TT>/net/cs</TT>
- then reads one line for each matching destination reachable
- from this system.
- The lines are of the form
- <I>filename message</I>,
- where
- <I>filename</I>
- is the path of the clone file to open for a new connection and
- <I>message</I>
- is the string to write to it to make the connection.
- The following example illustrates this.
- <TT>Ndb/csquery</TT>
- is a program that prompts for strings to write to
- <TT>/net/cs</TT>
- and prints the replies.
- <DL><DT><DD><TT><PRE>
- % ndb/csquery
- > net!helix!9fs
- /net/il/clone 135.104.9.31!17008
- /net/dk/clone nj/astro/helix!9fs
- </PRE></TT></DL>
- </P>
- <P>
- CS provides meta-name translation to perform complicated
- searches.
- The special network name
- <TT>net</TT>
- selects any network in common between source and
- destination supporting the specified service.
- A host name of the form <TT>$</TT><I>attr</I>
- is the name of an attribute in the network database.
- The database search returns the value
- of the matching attribute/value pair
- most closely associated with the source host.
- Most closely associated is defined on a per network basis.
- For example, the symbolic name
- <TT>tcp!$auth!rexauth</TT>
- causes CS to search for the
- <TT>auth</TT>
- attribute in the database entry for the source system, then its
- subnetwork (if there is one) and then its network.
- <DL><DT><DD><TT><PRE>
- % ndb/csquery
- > net!$auth!rexauth
- /net/il/clone 135.104.9.34!17021
- /net/dk/clone nj/astro/p9auth!rexauth
- /net/il/clone 135.104.9.6!17021
- /net/dk/clone nj/astro/musca!rexauth
- </PRE></TT></DL>
- </P>
- <P>
- Normally CS derives naming information from its database files.
- For domain names however, CS first consults another user level
- process, the domain name server (DNS).
- If no DNS is reachable, CS relies on its own tables.
- </P>
- <P>
- Like CS, the domain name server is a user level process providing
- one file,
- <TT>/net/dns</TT>.
- A client writes a request of the form
- <I>domain-name type</I>,
- where
- <I>type</I>
- is a domain name service resource record type.
- DNS performs a recursive query through the
- Internet domain name system producing one line
- per resource record found. The client reads
- <TT>/net/dns</TT>
- to retrieve the records.
- Like other domain name servers, DNS caches information
- learned from the network.
- DNS is implemented as a multi-process shared memory application
- with separate processes listening for network and local requests.
- </P>
- <H4>5 Library routines
- </H4>
- <P>
- The section on protocol devices described the details
- of making and receiving connections across a network.
- The dance is straightforward but tedious.
- Library routines are provided to relieve
- the programmer of the details.
- </P>
- <H4>5.1 Connecting
- </H4>
- <P>
- The
- <TT>dial</TT>
- library call establishes a connection to a remote destination.
- It
- returns an open file descriptor for the
- <TT>data</TT>
- file in the connection directory.
- <DL><DT><DD><TT><PRE>
- int dial(char *dest, char *local, char *dir, int *cfdp)
- </PRE></TT></DL>
- </P>
- <DL COMPACT>
- <DT><TT>dest</TT><DD>
- is the symbolic name/address of the destination.
- <DT><TT>local</TT><DD>
- is the local address.
- Since most networks do not support this, it is
- usually zero.
- <DT><TT>dir</TT><DD>
- is a pointer to a buffer to hold the path name of the protocol directory
- representing this connection.
- <TT>Dial</TT>
- fills this buffer if the pointer is non-zero.
- <DT><TT>cfdp</TT><DD>
- is a pointer to a file descriptor for the
- <TT>ctl</TT>
- file of the connection.
- If the pointer is non-zero,
- <TT>dial</TT>
- opens the control file and tucks the file descriptor here.
- </dl>
- <br> <br>
- Most programs call
- <TT>dial</TT>
- with a destination name and all other arguments zero.
- <TT>Dial</TT>
- uses CS to
- translate the symbolic name to all possible destination addresses
- and attempts to connect to each in turn until one works.
- Specifying the special name
- <TT>net</TT>
- in the network portion of the destination
- allows CS to pick a network/protocol in common
- with the destination for which the requested service is valid.
- For example, assume the system
- <TT>research.bell-labs.com</TT>
- has the Datakit address
- <TT>nj/astro/research</TT>
- and IP addresses
- <TT>135.104.117.5</TT>
- and
- <TT>129.11.4.1</TT>.
- The call
- <DL><DT><DD><TT><PRE>
- fd = dial("net!research.bell-labs.com!login", 0, 0, 0, 0);
- </PRE></TT></DL>
- tries in succession to connect to
- <TT>nj/astro/research!login</TT>
- on the Datakit and both
- <TT>135.104.117.5!513</TT>
- and
- <TT>129.11.4.1!513</TT>
- across the Internet.
- <P>
- <TT>Dial</TT>
- accepts addresses instead of symbolic names.
- For example, the destinations
- <TT>tcp!135.104.117.5!513</TT>
- and
- <TT>tcp!research.bell-labs.com!login</TT>
- are equivalent
- references to the same machine.
- </P>
- <H4>5.2 Listening
- </H4>
- <P>
- A program uses
- four routines to listen for incoming connections.
- It first
- <TT>announce()</TT>s
- its intention to receive connections,
- then
- <TT>listen()</TT>s
- for calls and finally
- <TT>accept()</TT>s
- or
- <TT>reject()</TT>s
- them.
- <TT>Announce</TT>
- returns an open file descriptor for the
- <TT>ctl</TT>
- file of a connection and fills
- <TT>dir</TT>
- with the
- path of the protocol directory
- for the announcement.
- <DL><DT><DD><TT><PRE>
- int announce(char *addr, char *dir)
- </PRE></TT></DL>
- <TT>Addr</TT>
- is the symbolic name/address announced;
- if it does not contain a service, the announcement is for
- all services not explicitly announced.
- Thus, one can easily write the equivalent of the
- <TT>inetd</TT>
- program without
- having to announce each separate service.
- An announcement remains in force until the control file is
- closed.
- </P>
- <br> <br>
- <TT>Listen</TT>
- returns an open file descriptor for the
- <TT>ctl</TT>
- file and fills
- <TT>ldir</TT>
- with the path
- of the protocol directory
- for the received connection.
- It is passed
- <TT>dir</TT>
- from the announcement.
- <DL><DT><DD><TT><PRE>
- int listen(char *dir, char *ldir)
- </PRE></TT></DL>
- <br> <br>
- <TT>Accept</TT>
- and
- <TT>reject</TT>
- are called with the control file descriptor and
- <TT>ldir</TT>
- returned by
- <TT>listen.</TT>
- Some networks such as Datakit accept a reason for a rejection;
- networks such as IP ignore the third argument.
- <DL><DT><DD><TT><PRE>
- int accept(int ctl, char *ldir)
- int reject(int ctl, char *ldir, char *reason)
- </PRE></TT></DL>
- <P>
- The following code implements a typical TCP listener.
- It announces itself, listens for connections, and forks a new
- process for each.
- The new process echoes data on the connection until the
- remote end closes it.
- The "*" in the symbolic name means the announcement is valid for
- any addresses bound to the machine the program is run on.
- <DL><DT><DD><TT><PRE>
- int
- echo_server(void)
- {
- int dfd, lcfd;
- char adir[40], ldir[40];
- int n;
- char buf[256];
- afd = announce("tcp!*!echo", adir);
- if(afd < 0)
- return -1;
- for(;;){
- /* listen for a call */
- lcfd = listen(adir, ldir);
- if(lcfd < 0)
- return -1;
- /* fork a process to echo */
- switch(fork()){
- case 0:
- /* accept the call and open the data file */
- dfd = accept(lcfd, ldir);
- if(dfd < 0)
- return -1;
- /* echo until EOF */
- while((n = read(dfd, buf, sizeof(buf))) > 0)
- write(dfd, buf, n);
- exits(0);
- case -1:
- perror("forking");
- default:
- close(lcfd);
- break;
- }
- }
- }
- </PRE></TT></DL>
- </P>
- <H4>6 User Level
- </H4>
- <P>
- Communication between Plan 9 machines is done almost exclusively in
- terms of 9P messages. Only the two services
- <TT>cpu</TT>
- and
- <TT>exportfs</TT>
- are used.
- The
- <TT>cpu</TT>
- service is analogous to
- <TT>rlogin</TT>.
- However, rather than emulating a terminal session
- across the network,
- <TT>cpu</TT>
- creates a process on the remote machine whose name space is an analogue of the window
- in which it was invoked.
- <TT>Exportfs</TT>
- is a user level file server which allows a piece of name space to be
- exported from machine to machine across a network. It is used by the
- <TT>cpu</TT>
- command to serve the files in the terminal's name space when they are
- accessed from the
- cpu server.
- </P>
- <P>
- By convention, the protocol and device driver file systems are mounted in a
- directory called
- <TT>/net</TT>.
- Although the per-process name space allows users to configure an
- arbitrary view of the system, in practice their profiles build
- a conventional name space.
- </P>
- <H4>6.1 Exportfs
- </H4>
- <P>
- <TT>Exportfs</TT>
- is invoked by an incoming network call.
- The
- <I>listener</I>
- (the Plan 9 equivalent of
- <TT>inetd</TT>)
- runs the profile of the user
- requesting the service to construct a name space before starting
- <TT>exportfs</TT>.
- After an initial protocol
- establishes the root of the file tree being
- exported,
- the remote process mounts the connection,
- allowing
- <TT>exportfs</TT>
- to act as a relay file server. Operations in the imported file tree
- are executed on the remote server and the results returned.
- As a result
- the name space of the remote machine appears to be exported into a
- local file tree.
- </P>
- <P>
- The
- <TT>import</TT>
- command calls
- <TT>exportfs</TT>
- on a remote machine, mounts the result in the local name space,
- and
- exits.
- No local process is required to serve mounts;
- 9P messages are generated by the kernel's mount driver and sent
- directly over the network.
- </P>
- <P>
- <TT>Exportfs</TT>
- must be multithreaded since the system calls
- <TT>open,</TT>
- <TT>read</TT>
- and
- <TT>write</TT>
- may block.
- Plan 9 does not implement the
- <TT>select</TT>
- system call but does allow processes to share file descriptors,
- memory and other resources.
- <TT>Exportfs</TT>
- and the configurable name space
- provide a means of sharing resources between machines.
- It is a building block for constructing complex name spaces
- served from many machines.
- </P>
- <P>
- The simplicity of the interfaces encourages naive users to exploit the potential
- of a richly connected environment.
- Using these tools it is easy to gateway between networks.
- For example a terminal with only a Datakit connection can import from the server
- <TT>helix</TT>:
- <DL><DT><DD><TT><PRE>
- import -a helix /net
- telnet ai.mit.edu
- </PRE></TT></DL>
- The
- <TT>import</TT>
- command makes a Datakit connection to the machine
- <TT>helix</TT>
- where
- it starts an instance
- <TT>exportfs</TT>
- to serve
- <TT>/net</TT>.
- The
- <TT>import</TT>
- command mounts the remote
- <TT>/net</TT>
- directory after (the
- <TT>-a</TT>
- option to
- <TT>import</TT>)
- the existing contents
- of the local
- <TT>/net</TT>
- directory.
- The directory contains the union of the local and remote contents of
- <TT>/net</TT>.
- Local entries supersede remote ones of the same name so
- networks on the local machine are chosen in preference
- to those supplied remotely.
- However, unique entries in the remote directory are now visible in the local
- <TT>/net</TT>
- directory.
- All the networks connected to
- <TT>helix</TT>,
- not just Datakit,
- are now available in the terminal. The effect on the name space is shown by the following
- example:
- <DL><DT><DD><TT><PRE>
- philw-gnot% ls /net
- /net/cs
- /net/dk
- philw-gnot% import -a musca /net
- philw-gnot% ls /net
- /net/cs
- /net/cs
- /net/dk
- /net/dk
- /net/dns
- /net/ether
- /net/il
- /net/tcp
- /net/udp
- </PRE></TT></DL>
- </P>
- <H4>6.2 Ftpfs
- </H4>
- <P>
- We decided to make our interface to FTP
- a file system rather than the traditional command.
- Our command,
- <I>ftpfs,</I>
- dials the FTP port of a remote system, prompts for login and password, sets image mode,
- and mounts the remote file system onto
- <TT>/n/ftp</TT>.
- Files and directories are cached to reduce traffic.
- The cache is updated whenever a file is created.
- Ftpfs works with TOPS-20, VMS, and various Unix flavors
- as the remote system.
- </P>
- <H4>7 Cyclone Fiber Links
- </H4>
- <P>
- The file servers and CPU servers are connected by
- high-bandwidth
- point-to-point links.
- A link consists of two VME cards connected by a pair of optical
- fibers.
- The VME cards use 33MHz Intel 960 processors and AMD's TAXI
- fiber transmitter/receivers to drive the lines at 125 Mbit/sec.
- Software in the VME card reduces latency by copying messages from system memory
- to fiber without intermediate buffering.
- </P>
- <H4>8 Performance
- </H4>
- <P>
- We measured both latency and throughput
- of reading and writing bytes between two processes
- for a number of different paths.
- Measurements were made on two- and four-CPU SGI Power Series processors.
- The CPUs are 25 MHz MIPS 3000s.
- The latency is measured as the round trip time
- for a byte sent from one process to another and
- back again.
- Throughput is measured using 16k writes from
- one process to another.
- <DL><DT><DD><TT><PRE>
- <br><img src="data.7581.gif"><br>
- </PRE></TT></DL>
- </P>
- <H4>9 Conclusion
- </H4>
- <P>
- The representation of all resources as file systems
- coupled with an ASCII interface has proved more powerful
- than we had originally imagined.
- Resources can be used by any computer in our networks
- independent of byte ordering or CPU type.
- The connection server provides an elegant means
- of decoupling tools from the networks they use.
- Users successfully use Plan 9 without knowing the
- topology of the system or the networks they use.
- More information about 9P can be found in the Section 5 of the Plan 9 Programmer's
- Manual, Volume I.
- </P>
- <H4>10 References
- </H4>
- <br> <br>
- [Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
- ``Plan 9 from Bell Labs'',
- UKUUG Proc. of the Summer 1990 Conf. ,
- London, England,
- 1990.
- <br> <br>
- [Needham] R. Needham, ``Names'', in
- Distributed systems,
- S. Mullender, ed.,
- Addison Wesley, 1989.
- <br> <br>
- [Presotto] D. Presotto, ``Multiprocessor Streams for Plan 9'',
- UKUUG Proc. of the Summer 1990 Conf. ,
- London, England, 1990.
- <br> <br>
- [Met80] R. Metcalfe, D. Boggs, C. Crane, E. Taf and J. Hupp, ``The
- Ethernet Local Network: Three reports'',
- CSL-80-2,
- XEROX Palo Alto Research Center, February 1980.
- <br> <br>
- [Fra80] A. G. Fraser, ``Datakit - A Modular Network for Synchronous
- and Asynchronous Traffic'',
- Proc. Int'l Conf. on Communication,
- Boston, June 1980.
- <br> <br>
- [Pet89a] L. Peterson, ``RPC in the X-Kernel: Evaluating new Design Techniques'',
- Proc. Twelfth Symp. on Op. Sys. Princ.,
- Litchfield Park, AZ, December 1990.
- <br> <br>
- [Rit84a] D. M. Ritchie, ``A Stream Input-Output System'',
- AT&T Bell Laboratories Technical Journal, 68(8),
- October 1984.
- <br> <br>
- <A href=http://www.lucent.com/copyright.html>
- Copyright</A> © 2000 Lucent Technologies Inc. All rights reserved.
- </body></html>
|