.Vx 17 11 November 87 1 32 "ROB PIKE" "THE TEXT EDITOR SAM" .ds DY "31 May 1987 .ds DR "Revised 1 July 1987 .de CW \" puts first arg in CW font, same as UL; maintains font \%\&\\$3\f(CW\\$1\fP\&\\$2 .. .de Cs .br .fi .ft 2 .ps -2 .vs -2 .. .de Ce .br .nf .ft 1 .ps .vs .sp .. .TL The Text Editor \&\f(CWsam\fP .AU Rob Pike rob@plan9.bell-labs.com .AB .LP .CW Sam is an interactive multi-file text editor intended for bitmap displays. A textual command language supplements the mouse-driven, cut-and-paste interface to make complex or repetitive editing tasks easy to specify. The language is characterized by the composition of regular expressions to describe the structure of the text being modified. The treatment of files as a database, with changes logged as atomic transactions, guides the implementation and makes a general `undo' mechanism straightforward. .PP .CW Sam is implemented as two processes connected by a low-bandwidth stream, one process handling the display and the other the editing algorithms. Therefore it can run with the display process in a bitmap terminal and the editor on a local host, with both processes on a bitmap-equipped host, or with the display process in the terminal and the editor in a remote host. By suppressing the display process, it can even run without a bitmap terminal. .PP This paper is reprinted from Software\(emPractice and Experience, Vol 17, number 11, pp. 813-845, November 1987. The paper has not been updated for the Plan 9 manuals. Although .CW Sam has not changed much since the paper was written, the system around it certainly has. Nonetheless, the description here still stands as the best introduction to the editor. .AE .SH Introduction .LP .CW Sam is an interactive text editor that combines cut-and-paste interactive editing with an unusual command language based on the composition of regular expressions. It is written as two programs: one, the `host part,' runs on a UNIX system and implements the command language and provides file access; the other, the `terminal part,' runs asynchronously on a machine with a mouse and bitmap display and supports the display and interactive editing. The host part may be even run in isolation on an ordinary terminal to edit text using the command language, much like a traditional line editor, without assistance from a mouse or display. Most often, the terminal part runs on a Blit\u\s-4\&1\s+4\d terminal (actually on a Teletype DMD 5620, the production version of the Blit), whose host connection is an ordinary 9600 bps RS232 link; on the SUN computer the host and display processes run on a single machine, connected by a pipe. .PP .CW Sam edits uninterpreted ASCII text. It has no facilities for multiple fonts, graphics or tables, unlike MacWrite,\u\s-4\&2\s+4\d Bravo,\u\s-4\&3\s+4\d Tioga\u\s-4\&4\s+4\d or Lara.\u\s-4\&5\s+4\d Also unlike them, it has a rich command language. (Throughout this paper, the phrase .I command language .R refers to textual commands; commands activated from the mouse form the .I mouse .I language. ) .CW Sam developed as an editor for use by programmers, and tries to join the styles of the UNIX text editor .CW ed \u\s-4\&6,7\s+4\d with that of interactive cut-and-paste editors by providing a comfortable mouse-driven interface to a program with a solid command language driven by regular expressions. The command language developed more than the mouse language, and acquired a notation for describing the structure of files more richly than as a sequence of lines, using a dataflow-like syntax for specifying changes. .PP The interactive style was influenced by .CW jim ,\u\s-4\&1\s+4\d an early cut-and-paste editor for the Blit, and by .CW mux ,\u\s-4\&8\s+4\d the Blit window system. .CW Mux merges the original Blit window system, .CW mpx ,\u\s-4\&1\s+4\d with cut-and-paste editing, forming something like a multiplexed version of .CW jim that edits the output of (and input to) command sessions rather than files. .PP The first part of this paper describes the command language, then the mouse language, and explains how they interact. That is followed by a description of the implementation, first of the host part, then of the terminal part. A principle that influenced the design of .CW sam is that it should have no explicit limits, such as upper limits on file size or line length. A secondary consideration is that it be efficient. To honor these two goals together requires a method for efficiently manipulating huge strings (files) without breaking them into lines, perhaps while making thousands of changes under control of the command language. .CW Sam 's method is to treat the file as a transaction database, implementing changes as atomic updates. These updates may be unwound easily to `undo' changes. Efficiency is achieved through a collection of caches that minimizes disc traffic and data motion, both within the two parts of the program and between them. .PP The terminal part of .CW sam is fairly straightforward. More interesting is how the two halves of the editor stay synchronized when either half may initiate a change. This is achieved through a data structure that organizes the communications and is maintained in parallel by both halves. .PP The last part of the paper chronicles the writing of .CW sam and discusses the lessons that were learned through its development and use. .PP The paper is long, but is composed largely of two papers of reasonable length: a description of the user interface of .CW sam and a discussion of its implementation. They are combined because the implementation is strongly influenced by the user interface, and vice versa. .SH The Interface .LP .CW Sam is a text editor for multiple files. File names may be provided when it is invoked: .P1 sam file1 file2 ... .P2 and there are commands to add new files and discard unneeded ones. Files are not read until necessary to complete some command. Editing operations apply to an internal copy made when the file is read; the UNIX file associated with the copy is changed only by an explicit command. To simplify the discussion, the internal copy is here called a .I file , while the disc-resident original is called a .I disc file. .R .PP .CW Sam is usually connected to a bitmap display that presents a cut-and-paste editor driven by the mouse. In this mode, the command language is still available: text typed in a special window, called the .CW sam .I window, is interpreted as commands to be executed in the current file. Cut-and-paste editing may be used in any window \(em even in the .CW sam window to construct commands. The other mode of operation, invoked by starting .CW sam with the option .CW -d (for `no download'), does not use the mouse or bitmap display, but still permits editing using the textual command language, even on an ordinary terminal, interactively or from a script. .PP The following sections describe first the command language (under .CW sam\ -d and in the .CW sam window), and then the mouse interface. These two languages are nearly independent, but connect through the .I current .I text, described below. .SH 2 The Command Language .LP A file consists of its contents, which are an array of characters (that is, a string); the .I name of the associated disc file; the .I modified bit .R that states whether the contents match those of the disc file; and a substring of the contents, called the .I current text .R or .I dot (see Figures 1 and 2). If the current text is a null string, dot falls between characters. The .I value of dot is the location of the current text; the .I contents of dot are the characters it contains. .CW Sam imparts to the text no two-dimensional interpretation such as columns or fields; text is always one-dimensional. Even the idea of a `line' of text as understood by most UNIX programs \(em a sequence of characters terminated by a newline character \(em is only weakly supported. .PP The .I current file .R is the file to which editing commands refer. The current text is therefore dot in the current file. If a command doesn't explicitly name a particular file or piece of text, the command is assumed to apply to the current text. For the moment, ignore the presence of multiple files and consider editing a single file. .KF L .BP fig1.ps 3.5i .Cs Figure 1. A typical .CW sam screen, with the editing menu presented. The .CW sam (command language) window is in the middle, with file windows above and below. (The user interface makes it easy to create these abutting windows.) The partially obscured window is a third file window. The uppermost window is that to which typing and mouse operations apply, as indicated by its heavy border. Each window has its current text highlighted in reverse video. The .CW sam window's current text is the null string on the last visible line, indicated by a vertical bar. See also Figure 2. .Ce .KE .PP Commands have one-letter names. Except for non-editing commands such as writing the file to disc, most commands make some change to the text in dot and leave dot set to the text resulting from the change. For example, the delete command, .CW d , deletes the text in dot, replacing it by the null string and setting dot to the result. The change command, .CW c , replaces dot by text delimited by an arbitrary punctuation character, conventionally a slash. Thus, .P1 c/Peter/ .P2 replaces the text in dot by the string .CW Peter . Similarly, .P1 a/Peter/ .P2 (append) adds the string after dot, and .P1 i/Peter/ .P2 (insert) inserts before dot. All three leave dot set to the new text, .CW Peter . .PP Newlines are part of the syntax of commands: the newline character lexically terminates a command. Within the inserted text, however, newlines are never implicit. But since it is often convenient to insert multiple lines of text, .CW sam has a special syntax for that case: .P1 a some lines of text to be inserted in the file, terminated by a period on a line by itself \&. .P2 In the one-line syntax, a newline character may be specified by a C-like escape, so .P1 c/\en/ .P2 replaces dot by a single newline character. .PP .CW Sam also has a substitute command, .CW s : .P1 s/\f2expression\fP/\f2replacement\fP/ .P2 substitutes the replacement text for the first match, in dot, of the regular expression. Thus, if dot is the string .CW Peter , the command .P1 s/t/st/ .P2 changes it to .CW Pester . In general, .CW s is unnecessary, but it was inherited from .CW ed and it has some convenient variations. For instance, the replacement text may include the matched text, specified by .CW & : .P1 s/Peter/Oh, &, &, &, &!/ .P2 .PP There are also three commands that apply programs to text: .P1 < \f2UNIX program\fP .P2 replaces dot by the output of the UNIX program. Similarly, the .CW > command runs the program with dot as its standard input, and .CW | does both. For example, .P1 | sort .P2 replaces dot by the result of applying the standard sorting utility to it. Again, newlines have no special significance for these .CW sam commands. The text acted upon and resulting from these commands is not necessarily bounded by newlines, although for connection with UNIX programs, newlines may be necessary to obey conventions. .PP One more command: .CW p prints the contents of dot. Table I summarizes .CW sam 's commands. .KF .TS center; c s lfCW l. Table I. \f(CWSam\fP commands .sp .4 .ft CW _ .ft .sp .4 \f1Text commands\fP .sp .4 _ .sp .4 a/\f2text\fP/ Append text after dot c/\f2text\fP/ Change text in dot i/\f2text\fP/ Insert text before dot d Delete text in dot s/\f2regexp\fP/\f2text\fP/ Substitute text for match of regular expression in dot m \f2address\fP Move text in dot after address t \f2address\fP Copy text in dot after address .sp .4 _ .sp .4 \f1Display commands\fP .sp .4 _ .sp .2 p Print contents of dot \&= Print value (line numbers and character numbers) of dot .sp .4 _ .sp .4 \f1File commands\fP .sp .4 _ .sp .2 b \f2file-list\fP Set current file to first file in list that \f(CWsam\fP has in menu B \f2file-list\fP Same as \f(CWb\fP, but load new files n Print menu lines of all files D \f2file-list\fP Delete named files from \f(CWsam\fP .sp .4 _ .sp .4 \f1I/O commands\fP .sp .4 _ .sp .2 e \f2filename\fP Replace file with named disc file r \f2filename\fP Replace dot by contents of named disc file w \f2filename\fP Write file to named disc file f \f2filename\fP Set file name and print new menu line < \f2UNIX-command\fP Replace dot by standard output of command > \f2UNIX-command\fP Send dot to standard input of command | \f2UNIX-command\fP Replace dot by result of command applied to dot ! \f2UNIX-command\fP Run the command .sp .4 _ .sp .4 \f1Loops and conditionals\fP .sp .4 _ .sp .2 x/\f2regexp\fP/ \f2command\fP For each match of regexp, set dot and run command y/\f2regexp\fP/ \f2command\fP Between adjacent matches of regexp, set dot and run command X/\f2regexp\fP/ \f2command\fP Run command in each file whose menu line matches regexp Y/\f2regexp\fP/ \f2command\fP Run command in each file whose menu line does not match g/\f2regexp\fP/ \f2command\fP If dot contains a match of regexp, run command v/\f2regexp\fP/ \f2command\fP If dot does not contain a match of regexp, run command .sp .4 _ .sp .4 \f1Miscellany\fP .sp .4 _ .sp .2 k Set address mark to value of dot q Quit u \f2n\fP Undo last \f2n\fP (default 1) changes { } Braces group commands .sp .3 .ft CW _ .ft .TE .sp .KE .PP The value of dot may be changed by specifying an .I address for the command. The simplest address is a line number: .P1 3 .P2 refers to the third line of the file, so .P1 3d .P2 deletes the third line of the file, and implicitly renumbers the lines so the old line 4 is now numbered 3. (This is one of the few places where .CW sam deals with lines directly.) Line .CW 0 is the null string at the beginning of the file. If a command consists of only an address, a .CW p command is assumed, so typing an unadorned .CW 3 prints line 3 on the terminal. There are a couple of other basic addresses: a period addresses dot itself; and a dollar sign .CW $ ) ( addresses the null string at the end of the file. .PP An address is always a single substring of the file. Thus, the address .CW 3 addresses the characters after the second newline of the file through the third newline of the file. A .I compound address .R is constructed by the comma operator .P1 \f2address1\fP,\f2address2\fP .P2 and addresses the substring of the file from the beginning of .I address1 to the end of .I address2 . For example, the command .CW 3,5p prints the third through fifth lines of the file and .CW .,$d deletes the text from the beginning of dot to the end of the file. .PP These addresses are all absolute positions in the file, but .CW sam also has relative addresses, indicated by .CW + or .CW - . For example, .P1 $-3 .P2 is the third line before the end of the file and .P1 \&.+1 .P2 is the line after dot. If no address appears to the left of the .CW + or .CW - , dot is assumed; if nothing appears to the right, .CW 1 is assumed. Therefore, .CW .+1 may be abbreviated to just a plus sign. .PP The .CW + operator acts relative to the end of its first argument, while the .CW - operator acts relative to the beginning. Thus .CW .+1 addresses the first line after dot, .CW .- addresses the first line before dot, and .CW +- refers to the line containing the end of dot. (Dot may span multiple lines, and .CW + selects the line after the end of dot, then .CW - backs up one line.) .PP The final type of address is a regular expression, which addresses the text matched by the expression. The expression is enclosed in slashes, as in .P1 /\f2expression\fP/ .P2 The expressions are the same as those in the UNIX program .CW egrep ,\u\s-4\&6,7\s+4\d and include closures, alternations, and so on. They find the .I leftmost longest .R string that matches the expression, that is, the first match after the point where the search is started, and if more than one match begins at the same spot, the longest such match. (I assume familiarity with the syntax for regular expressions in UNIX programs.\u\s-4\&9\s+4\d) For example, .P1 /x/ .P2 matches the next .CW x character in the file, .P1 /xx*/ .P2 matches the next run of one or more .CW x 's, and .P1 /x|Peter/ .P2 matches the next .CW x or .CW Peter . For compatibility with other UNIX programs, the `any character' operator, a period, does not match a newline, so .P1 /.*/ .P2 matches the text from dot to the end of the line, but excludes the newline and so will not match across the line boundary. .PP Regular expressions are always relative addresses. The direction is forwards by default, so .CW /Peter/ is really an abbreviation for .CW +/Peter/ . The search can be reversed with a minus sign, so .P1 .CW -/Peter/ .P2 finds the first .CW Peter before dot. Regular expressions may be used with other address forms, so .CW 0+/Peter/ finds the first .CW Peter in the file and .CW $-/Peter/ finds the last. Table II summarizes .CW sam 's addresses. .KF .TS center; c s lfCW l. Table II. \f(CWSam\fP addresses .sp .4 .ft CW _ .ft .sp .4 \f1Simple addresses\fP .sp .4 _ .sp .2 #\f2n\fP The empty string after character \f2n\fP \f2n\fP Line \f2n\fP. /\f2regexp\fP/ The first following match of the regular expression -/\f2regexp\fP/ The first previous match of the regular expression $ The null string at the end of the file \&. Dot \&' The address mark, set by \f(CWk\fP command "\f2regexp\fP" Dot in the file whose menu line matches regexp .sp .4 _ .sp .4 \f1Compound addresses\fP .sp .4 _ .sp .2 \f2a1\fP+\f2a2\fP The address \f2a2\fP evaluated starting at right of \f2a1\fP \f2a1\fP-\f2a2\fP \f2a2\fP evaluated in the reverse direction starting at left of \f2a1\fP \f2a1\fP,\f2a2\fP From the left of \f2a1\fP to the right of \f2a2\fP (default \f(CW0,$\fP) \f2a1\fP;\f2a2\fP Like \f(CW,\fP but sets dot after evaluating \f2a1\fP .sp .4 _ .sp .4 .T& c s. T{ The operators .CW + and .CW - are high precedence, while .CW , and .CW ; are low precedence. In both .CW + and .CW - forms, .I a2 defaults to 1 and .I a1 defaults to dot. If both .I a1 and .I a2 are present, .CW + may be elided. T} .sp .5 .ft CW _ .ft .TE .sp .KE .PP The language discussed so far will not seem novel to people who use UNIX text editors such as .CW ed or .CW vi .\u\s-4\&9\s+4\d Moreover, the kinds of editing operations these commands allow, with the exception of regular expressions and line numbers, are clearly more conveniently handled by a mouse-based interface. Indeed, .CW sam 's mouse language (discussed at length below) is the means by which simple changes are usually made. For large or repetitive changes, however, a textual language outperforms a manual interface. .PP Imagine that, instead of deleting just one occurrence of the string .CW Peter , we wanted to eliminate every .CW Peter . What's needed is an iterator that runs a command for each occurrence of some text. .CW Sam 's iterator is called .CW x , for extract: .P1 x/\f2expression\fP/ \f2command\fP .P2 finds all matches in dot of the specified expression, and for each such match, sets dot to the text matched and runs the command. So to delete all the .CW Peters: .P1 0,$ x/Peter/ d .P2 (Blanks in these examples are to improve readability; .CW sam neither requires nor interprets them.) This searches the entire file .CW 0,$ ) ( for occurrences of the string .CW Peter , and runs the .CW d command with dot set to each such occurrence. (By contrast, the comparable .CW ed command would delete all .I lines containing .CW Peter ; .CW sam deletes only the .CW Peters .) The address .CW 0,$ is commonly used, and may be abbreviated to just a comma. As another example, .P1 , x/Peter/ p .P2 prints a list of .CW Peters, one for each appearance in the file, with no intervening text (not even newlines to separate the instances). .PP Of course, the text extracted by .CW x may be selected by a regular expression, which complicates deciding what set of matches is chosen \(em matches may overlap. This is resolved by generating the matches starting from the beginning of dot using the leftmost-longest rule, and searching for each match starting from the end of the previous one. Regular expressions may also match null strings, but a null match adjacent to a non-null match is never selected; at least one character must intervene. For example, .P1 , c/AAA/ x/B*/ c/-/ , p .P2 produces as output .P1 -A-A-A- .P2 because the pattern .CW B* matches the null strings separating the .CW A 's. .PP The .CW x command has a complement, .CW y , with similar syntax, that executes the command with dot set to the text .I between the matches of the expression. For example, .P1 , c/AAA/ y/A/ c/-/ , p .P2 produces the same result as the example above. .PP The .CW x and .CW y commands are looping constructs, and .CW sam has a pair of conditional commands to go with them. They have similar syntax: .P1 g/\f2expression\fP/ \f2command\fP .P2 (guard) runs the command exactly once if dot contains a match of the expression. This is different from .CW x , which runs the command for .I each match: .CW x loops; .CW g merely tests, without changing the value of dot. Thus, .P1 , x/Peter/ d .P2 deletes all occurrences of .CW Peter , but .P1 , g/Peter/ d .P2 deletes the whole file (reduces it to a null string) if .CW Peter occurs anywhere in the text. The complementary conditional is .CW v , which runs the command if there is .I no match of the expression. .PP These control-structure-like commands may be composed to construct more involved operations. For example, to print those lines of text that contain the string .CW Peter : .P1 , x/.*\en/ g/Peter/ p .P2 The .CW x breaks the file into lines, the .CW g selects those lines containing .CW Peter , and the .CW p prints them. This command gives an address for the .CW x command (the whole file), but because .CW g does not have an explicit address, it applies to the value of dot produced by the .CW x command, that is, to each line. All commands in .CW sam except for the command to write a file to disc use dot for the default address. .PP Composition may be continued indefinitely. .P1 , x/.*\en/ g/Peter/ v/SaltPeter/ p .P2 prints those lines containing .CW Peter but .I not those containing .CW SaltPeter . .SH 2 Structural Regular Expressions .LP Unlike other UNIX text editors, including the non-interactive ones such as .CW sed and .CW awk ,\u\s-4\&7\s+4\d .CW sam is good for manipulating files with multi-line `records.' An example is an on-line phone book composed of records, separated by blank lines, of the form .P1 Herbert Tic 44 Turnip Ave., Endive, NJ 201-5555642 Norbert Twinge 16 Potato St., Cabbagetown, NJ 201-5553145 \&... .P2 The format may be encoded as a regular expression: .P1 (.+\en)+ .P2 that is, a sequence of one or more non-blank lines. The command to print Mr. Tic's entire record is then .P1 , x/(.+\en)+/ g/^Herbert Tic$/ p .P2 and that to extract just the phone number is .P1 , x/(.+\en)+/ g/^Herbert Tic$/ x/^[0-9]*-[0-9]*\en/ p .P2 The latter command breaks the file into records, chooses Mr. Tic's record, extracts the phone number from the record, and finally prints the number. .PP A more involved problem is that of renaming a particular variable, say .CW n , to .CW num in a C program. The obvious first attempt, .P1 , x/n/ c/num/ .P2 is badly flawed: it changes not only the variable .CW n but any letter .CW n that appears. We need to extract all the variables, and select those that match .CW n and only .CW n : .P1 , x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/ .P2 The pattern .CW [A-Za-z_][A-Za-z_0-9]* matches C identifiers. Next .CW g/n/ selects those containing an .CW n . Then .CW v/../ rejects those containing two (or more) characters, and finally .CW c/num/ changes the remainder (identifiers .CW n ) to .CW num . This version clearly works much better, but there may still be problems. For example, in C character and string constants, the sequence .CW \en is interpreted as a newline character, and we don't want to change it to .CW \enum. This problem can be forestalled with a .CW y command: .P1 , y/\e\en/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/ .P2 (the second .CW \e is necessary because of lexical conventions in regular expressions), or we could even reject character constants and strings outright: .P1 0 ,y/'[^']*'/ y/"[^"]*"/ x/[A-Za-z_][A-Za-z_0-9]*/ g/n/ v/../ c/num/ .P2 The .CW y commands in this version exclude from consideration all character constants and strings. The only remaining problem is to deal with the possible occurrence of .CW \e' or .CW \e" within these sequences, but it's easy to see how to resolve this difficulty. .PP The point of these composed commands is successive refinement. A simple version of the command is tried, and if it's not good enough, it can be honed by adding a clause or two. (Mistakes can be undone; see below. Also, the mouse language makes it unnecessary to retype the command each time.) The resulting chains of commands are somewhat reminiscent of shell pipelines.\u\s-4\&7\s+4\d Unlike pipelines, though, which pass along modified .I data , .CW sam commands pass a .I view of the data. The text at each step of the command is the same, but which pieces are selected is refined step by step until the correct piece is available to the final step of the command line, which ultimately makes the change. .PP In other UNIX programs, regular expressions are used only for selection, as in the .CW sam .CW g command, never for extraction as in the .CW x or .CW y command. For example, patterns in .CW awk \u\s-4\&7\s+4\d are used to select lines to be operated on, but cannot be used to describe the format of the input text, or to handle newline-free text. The use of regular expressions to describe the structure of a piece of text rather than its contents, as in the .CW x command, has been given a name: .I structural regular expressions. .R When they are composed, as in the above example, they are pleasantly expressive. Their use is discussed at greater length elsewhere.\u\s-4\&10\s+4\d .PP .SH 2 Multiple files .LP .CW Sam has a few other commands, mostly relating to input and output. .P1 e discfilename .P2 replaces the contents and name of the current file with those of the named disc file; .P1 w discfilename .P2 writes the contents to the named disc file; and .P1 r discfilename .P2 replaces dot with the contents of the named disc file. All these commands use the current file's name if none is specified. Finally, .P1 f discfilename .P2 changes the name associated with the file and displays the result: .P1 \&'-. discfilename .P2 This output is called the file's .I menu line, .R because it is the contents of the file's line in the button 3 menu (described in the next section). The first three characters are a concise notation for the state of the file. The apostrophe signifies that the file is modified. The minus sign indicates the number of windows open on the file (see the next section): .CW - means none, .CW + means one, and .CW * means more than one. Finally, the period indicates that this is the current file. These characters are useful for controlling the .CW X command, described shortly. .PP .CW Sam may be started with a set of disc files (such as all the source for a program) by invoking it with a list of file names as arguments, and more may be added or deleted on demand. .P1 B discfile1 discfile2 ... .P2 adds the named files to .CW sam 's list, and .P1 D discfile1 discfile2 ... .P2 removes them from .CW sam 's memory (without effect on associated disc files). Both these commands have a syntax for using the shell\u\s-4\&7\s+4\d (the UNIX command interpreter) to generate the lists: .P1 B