sexp 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699
  1. Network Working Group R. Rivest
  2. Internet Draft May 4, 1997
  3. Expires November 4, 1997
  4. S-Expressions
  5. draft-rivest-sexp-00.txt
  6. Status of this Memo
  7. Distribution of this memo is unlimited.
  8. This document is an Internet-Draft. Internet Drafts are working
  9. documents of the Internet Engineering Task Force (IETF), its Areas,
  10. and its Working Groups. Note that other groups may also distribute
  11. working documents as Internet Drafts.
  12. Internet Drafts are draft documents valid for a maximum of six
  13. months, and may be updated, replaced, or obsoleted by other documents
  14. at any time. It is not appropriate to use Internet Drafts as
  15. reference material, or to cite them other than as a ``working draft''
  16. or ``work in progress.''
  17. To learn the current status of any Internet-Draft, please check the
  18. ``1id-abstracts.txt'' listing contained in the internet-drafts Shadow
  19. Directories on: ftp.is.co.za (Africa), nic.nordu.net (Europe),
  20. ds.internic.net (US East Coast), ftp.isi.edu (US West Coast),
  21. or munnari.oz.au (Pacific Rim)
  22. Abstract
  23. This memo describes a data structure called "S-expressions" that are
  24. suitable for representing arbitrary complex data structures. We make
  25. precise the encodings of S-expressions: we give a "canonical form" for
  26. S-expressions, described two "transport" representations, and also
  27. describe an "advanced" format for display to people.
  28. 1. Introduction
  29. S-expressions are data structures for representing complex data. They
  30. are either byte-strings ("octet-strings") or lists of simpler
  31. S-expressions. Here is a sample S-expression:
  32. (snicker "abc" (#03# |YWJj|))
  33. It is a list of length three:
  34. -- the octet-string "snicker"
  35. -- the octet-string "abc"
  36. -- a sub-list containing two elements:
  37. - the hexadecimal constant #03#
  38. - the base-64 constant |YWJj| (which is the same as "abc")
  39. This note gives a specific proposal for constructing and utilizing
  40. S-expressions. The proposal is independent of any particular application.
  41. Here are the design goals for S-expressions:
  42. -- generality: S-expressions should be good at representing arbitrary
  43. data.
  44. -- readability: it should be easy for someone to examine and
  45. understand the structure of an S-expression.
  46. -- economy: S-expressions should represent data compactly.
  47. -- tranportability: S-expressions should be easy to transport
  48. over communication media (such as email) that are known to be
  49. less than perfect.
  50. -- flexibility: S-expressions should make it relatively simple to
  51. modify and extend data structures.
  52. -- canonicalization: it should be easy to produce a unique
  53. "canonical" form of an S-expression, for digital signature purposes.
  54. -- efficiency: S-expressions should admit in-memory representations
  55. that allow efficient processing.
  56. Section 2 gives an introduction to S-expressions.
  57. Section 3 discusses the character sets used.
  58. Section 4 presents the various representations of octet-strings.
  59. Section 5 describes how to represent lists.
  60. Section 6 discusses how S-expressions are represented for various uses.
  61. Section 7 gives a BNF syntax for S-expressions.
  62. Section 8 talks about how S-expressions might be represented in memory.
  63. Section 9 briefly describes implementations for handling S-expressions.
  64. Section 10 discusses how applications might utilize S-expressions.
  65. Section 11 gives historical notes on S-expressions.
  66. Section 12 gives references.
  67. 2. S-expressions -- informal introduction
  68. Informally, an S-expression is either:
  69. -- an octet-string, or
  70. -- a finite list of simpler S-expressions.
  71. An octet-string is a finite sequence of eight-bit octets. There may be
  72. many different but equivalent ways of representing an octet-string
  73. abc -- as a token
  74. "abc" -- as a quoted string
  75. #616263# -- as a hexadecimal string
  76. 3:abc -- as a length-prefixed "verbatim" encoding
  77. {MzphYmM=} -- as a base-64 encoding of the verbatim encoding
  78. (that is, an encoding of "3:abc")
  79. |YWJj| -- as a base-64 encoding of the octet-string "abc"
  80. These encodings are all equivalent; they all denote the same octet string.
  81. We will give details of these encodings later on, and also describe how to
  82. give a "display type" to a byte string.
  83. A list is a finite sequence of zero or more simpler S-expressions. A list
  84. may be represented by using parentheses to surround the sequence of encodings
  85. of its elements, as in:
  86. (abc (de #6667#) "ghi jkl")
  87. As we see, there is variability possible in the encoding of an
  88. S-expression. In some cases, it is desirable to standardize or
  89. restrict the encodings; in other cases it is desirable to have no
  90. restrictions. The following are the target cases we aim to handle:
  91. -- a "transport" encoding for transporting the S-expression between
  92. computers.
  93. -- a "canonical" encoding, used when signing the S-expression.
  94. -- an "advanced" encoding used for input/output to people.
  95. -- an "in-memory" encoding used for processing the S-expression in
  96. the computer.
  97. These need not be different; in this proposal the canonical encoding
  98. is the same as the transport encoding, for example. In this note we
  99. propose (related) encoding techniques for each of these uses.
  100. 3. Character set
  101. We will be describing encodings of S-expressions. Except when giving
  102. "verbatim" encodings, the character set used is limited to the following
  103. characters in US-ASCII:
  104. Alphabetic: A B ... Z a b ... z
  105. numeric: 0 1 ... 9
  106. whitespace: space, horizontal tab, vertical tab, form-feed
  107. carriage-return, line-feed
  108. The following graphics characters, which we call "pseudo-alphabetic":
  109. - hyphen or minus
  110. . period
  111. / slash
  112. _ underscore
  113. : colon
  114. * asterisk
  115. + plus
  116. = equal
  117. The following graphics characters, which are "reserved punctuation":
  118. ( left parenthesis
  119. ) right parenthesis
  120. [ left bracket
  121. ] right bracket
  122. { left brace
  123. } right brace
  124. | vertical bar
  125. # number sign
  126. " double quote
  127. & ampersand
  128. \ backslash
  129. The following characters are unused and unavailable, except in
  130. "verbatim" encodings:
  131. ! exclamation point
  132. % percent
  133. ^ circumflex
  134. ~ tilde
  135. ; semicolon
  136. ' apostrophe
  137. , comma
  138. < less than
  139. > greater than
  140. ? question mark
  141. 4. Octet string representations
  142. This section describes in detail the ways in which an octet-string may
  143. be represented.
  144. We recall that an octet-string is any finite sequence of octets, and
  145. that the octet-string may have length zero.
  146. 4.1 Verbatim representation
  147. A verbatim encoding of an octet string consists of four parts:
  148. -- the length (number of octets) of the octet-string,
  149. given in decimal most significant digit first, with
  150. no leading zeros.
  151. -- a colon ":"
  152. -- the octet string itself, verbatim.
  153. There are no blanks or whitespace separating the parts. No "escape
  154. sequences" are interpreted in the octet string. This encoding is also
  155. called a "binary" or "raw" encoding.
  156. Here are some sample verbatim encodings:
  157. 3:abc
  158. 7:subject
  159. 4:::::
  160. 12:hello world!
  161. 10:abcdefghij
  162. 0:
  163. 4.2 Quoted-string representation
  164. The quoted-string representation of an octet-string consists of:
  165. -- an optional decimal length field
  166. -- an initial double-quote (")
  167. -- the octet string with "C" escape conventions (\n,etc)
  168. -- a final double-quote (")
  169. The specified length is the length of the resulting string after any
  170. escape sequences have been handled. The string does not have any
  171. "terminating NULL" that C includes, and the length does not count such
  172. a character.
  173. The length is optional.
  174. The escape conventions within the quoted string are as follows (these follow
  175. the "C" programming language conventions, with an extension for
  176. ignoring line terminators of just LF or CRLF):
  177. \b -- backspace
  178. \t -- horizontal tab
  179. \v -- vertical tab
  180. \n -- new-line
  181. \f -- form-feed
  182. \r -- carriage-return
  183. \" -- double-quote
  184. \' -- single-quote
  185. \\ -- back-slash
  186. \ooo -- character with octal value ooo (all three digits
  187. must be present)
  188. \xhh -- character with hexadecimal value hh (both digits
  189. must be present)
  190. \<carriage-return> -- causes carriage-return to be ignored.
  191. \<line-feed> -- causes linefeed to be ignored
  192. \<carriage-return><line-feed> -- causes CRLF to be ignored.
  193. \<line-feed><carriage-return> -- causes LFCR to be ignored.
  194. Here are some examples of quoted-string encodings:
  195. "subject"
  196. "hi there"
  197. 7"subject"
  198. 3"\n\n\n"
  199. "This has\n two lines."
  200. "This has\
  201. one."
  202. ""
  203. 4.3 Token representation
  204. An octet string that meets the following conditions may be given
  205. directly as a "token".
  206. -- it does not begin with a digit
  207. -- it contains only characters that are
  208. -- alphabetic (upper or lower case),
  209. -- numeric, or
  210. -- one of the eight "pseudo-alphabetic" punctuation marks:
  211. - . / _ : * + =
  212. (Note: upper and lower case are not equivalent.)
  213. (Note: A token may begin with punctuation, including ":").
  214. Here are some examples of token representations:
  215. subject
  216. not-before
  217. class-of-1997
  218. //microsoft.com/names/smith
  219. *
  220. 4.4 Hexadecimal representation
  221. An octet-string may be represented with a hexadecimal encoding consisting of:
  222. -- an (optional) decimal length of the octet string
  223. -- a sharp-sign "#"
  224. -- a hexadecimal encoding of the octet string, with each octet
  225. represented with two hexadecimal digits, most significant
  226. digit first.
  227. -- a sharp-sign "#"
  228. There may be whitespace inserted in the midst of the hexadecimal
  229. encoding arbitrarily; it is ignored. It is an error to have
  230. characters other than whitespace and hexadecimal digits.
  231. Here are some examples of hexadecimal encodings:
  232. #616263# -- represents "abc"
  233. 3#616263# -- also represents "abc"
  234. # 616
  235. 263 # -- also represents "abc"
  236. 4.5 Base-64 representation
  237. An octet-string may be represented in a base-64 coding consisting of:
  238. -- an (optional) decimal length of the octet string
  239. -- a vertical bar "|"
  240. -- the rfc 1521 base-64 encoding of the octet string.
  241. -- a final vertical bar "|"
  242. The base-64 encoding uses only the characters
  243. A-Z a-z 0-9 + / =
  244. It produces four characters of output for each three octets of input.
  245. If the input has one or two left-over octets of input, it produces an
  246. output block of length four ending in two or one equals signs, respectively.
  247. Output routines compliant with this standard MUST output the equals signs
  248. as specified. Input routines MAY accept inputs where the equals signs are
  249. dropped.
  250. There may be whitespace inserted in the midst of the base-64 encoding
  251. arbitrarily; it is ignored. It is an error to have characters other
  252. than whitespace and base-64 characters.
  253. Here are some examples of base-64 encodings:
  254. |YWJj| -- represents "abc"
  255. | Y W
  256. J j | -- also represents "abc"
  257. 3|YWJj| -- also represents "abc"
  258. |YWJjZA==| -- represents "abcd"
  259. |YWJjZA| -- also represents "abcd"
  260. 4.6 Display hint
  261. Any octet string may be preceded by a single "display hint".
  262. The purposes of the display hint is to provide information on how
  263. to display the octet string to a user. It has no other function.
  264. Many of the MIME types work here.
  265. A display-hint is an octet string surrounded by square brackets.
  266. There may be whitespace separating the octet string from the
  267. surrounding brackets. Any of the legal formats may be used for the
  268. octet string.
  269. Here are some examples of display-hints:
  270. [image/gif]
  271. [URI]
  272. [charset=unicode-1-1]
  273. [text/richtext]
  274. [application/postscript]
  275. [audio/basic]
  276. ["http://abc.com/display-types/funky.html"]
  277. In applications an octet-string that is untyped may be considered to have
  278. a pre-specified "default" mime type. The mime type
  279. "text/plain; charset=iso-8859-1"
  280. is the standard default.
  281. 4.7 Equality of octet-strings
  282. Two octet strings are considered to be "equal" if and only if they
  283. have the same display hint and the same data octet strings.
  284. Note that octet-strings are "case-sensitive"; the octet-string "abc"
  285. is not equal to the octet-string "ABC".
  286. An untyped octet-string can be compared to another octet-string (typed
  287. or not) by considering it as a typed octet-string with the default
  288. mime-type.
  289. 5. Lists
  290. Just as with octet-strings, there are several ways to represent an
  291. S-expression. Whitespace may be used to separate list elements, but
  292. they are only required to separate two octet strings when otherwise
  293. the two octet strings might be interpreted as one, as when one token
  294. follows another. Also, whitespace may follow the initial left
  295. parenthesis, or precede the final right parenthesis.
  296. Here are some examples of encodings of lists:
  297. (a b c)
  298. ( a ( b c ) ( ( d e ) ( e f ) ) )
  299. (11:certificate(6:issuer3:bob)(7:subject5:alice))
  300. ({3Rt=} "1997" murphy 3:{XC++})
  301. 6. Representation types
  302. There are three "types" of representations:
  303. -- canonical
  304. -- basic transport
  305. -- advanced transport
  306. The first two MUST be supported by any implementation; the last is
  307. optional.
  308. 6.1 Canonical representation
  309. This canonical representation is used for digital signature purposes,
  310. transmission, etc. It is uniquely defined for each S-expression. It
  311. is not particularly readable, but that is not the point. It is
  312. intended to be very easy to parse, to be reasonably economical, and to
  313. be unique for any S-expression.
  314. The "canonical" form of an S-expression represents each octet-string
  315. in verbatim mode, and represents each list with no blanks separating
  316. elements from each other or from the surrounding parentheses.
  317. Here are some examples of canonical representations of S-expressions:
  318. (6:issuer3:bob)
  319. (4:icon[12:image/bitmap]9:xxxxxxxxx)
  320. (7:subject(3:ref5:alice6:mother))
  321. 6.2 Basic transport representation
  322. There are two forms of the "basic transport" representation:
  323. -- the canonical representation
  324. -- an rfc-2045 base-64 representation of the canonical representation,
  325. surrounded by braces.
  326. The transport mechanism is intended to provide a universal means of
  327. representing S-expressions for transport from one machine to another.
  328. Here are some examples of an S-expression represented in basic
  329. transport mode:
  330. (1:a1:b1:c)
  331. {KDE6YTE6YjE6YykA}
  332. (this is the same S-expression encoded in base-64)
  333. There is a difference between the brace notation for base-64 used here
  334. and the || notation for base-64'd octet-strings described above. Here
  335. the base-64 contents are converted to octets, and then re-scanned as
  336. if they were given originally as octets. With the || notation, the
  337. contents are just turned into an octet-string.
  338. 6.3 Advanced transport representation
  339. The "advanced transport" representation is intended to provide more
  340. flexible and readable notations for documentation, design, debugging,
  341. and (in some cases) user interface.
  342. The advanced transport representation allows all of the representation
  343. forms described above, include quoted strings, base-64 and hexadecimal
  344. representation of strings, tokens, representations of strings with
  345. omitted lengths, and so on.
  346. 7. BNF for syntax
  347. We give separate BNF's for canonical and advanced forms of S-expressions.
  348. We use the following notation:
  349. <x>* means 0 or more occurrences of <x>
  350. <x>+ means 1 or more occurrences of <x>
  351. <x>? means 0 or 1 occurrences of <x>
  352. parentheses are used for grouping, as in (<x> | <y>)*
  353. For canonical and basic transport:
  354. <sexpr> :: <string> | <list>
  355. <string> :: <display>? <simple-string> ;
  356. <simple-string> :: <raw> ;
  357. <display> :: "[" <simple-string> "]" ;
  358. <raw> :: <decimal> ":" <bytes> ;
  359. <decimal> :: <decimal-digit>+ ;
  360. -- decimal numbers should have no unnecessary leading zeros
  361. <bytes> -- any string of bytes, of the indicated length
  362. <list> :: "(" <sexp>* ")" ;
  363. <decimal-digit> :: "0" | ... | "9" ;
  364. For advanced transport:
  365. <sexpr> :: <string> | <list>
  366. <string> :: <display>? <simple-string> ;
  367. <simple-string> :: <raw> | <token> | <base-64> | <hexadecimal> |
  368. <quoted-string> ;
  369. <display> :: "[" <simple-string> "]" ;
  370. <raw> :: <decimal> ":" <bytes> ;
  371. <decimal> :: <decimal-digit>+ ;
  372. -- decimal numbers should have no unnecessary leading zeros
  373. <bytes> -- any string of bytes, of the indicated length
  374. <token> :: <tokenchar>+ ;
  375. <base-64> :: <decimal>? "|" ( <base-64-char> | <whitespace> )* "|" ;
  376. <hexadecimal> :: "#" ( <hex-digit> | <white-space> )* "#" ;
  377. <quoted-string> :: <decimal>? <quoted-string-body>
  378. <quoted-string-body> :: "\"" <bytes> "\""
  379. <list> :: "(" ( <sexp> | <whitespace> )* ")" ;
  380. <whitespace> :: <whitespace-char>* ;
  381. <token-char> :: <alpha> | <decimal-digit> | <simple-punc> ;
  382. <alpha> :: <upper-case> | <lower-case> | <digit> ;
  383. <lower-case> :: "a" | ... | "z" ;
  384. <upper-case> :: "A" | ... | "Z" ;
  385. <decimal-digit> :: "0" | ... | "9" ;
  386. <hex-digit> :: <decimal-digit> | "A" | ... | "F" | "a" | ... | "f" ;
  387. <simple-punc> :: "-" | "." | "/" | "_" | ":" | "*" | "+" | "=" ;
  388. <whitespace-char> :: " " | "\t" | "\r" | "\n" ;
  389. <base-64-char> :: <alpha> | <decimal-digit> | "+" | "/" | "=" ;
  390. <null> :: "" ;
  391. 8. In-memory representations
  392. For processing, the S-expression would typically be parsed and represented
  393. in memory in a more more amenable to efficient processing. We suggest
  394. two alternatives:
  395. -- "list-structure"
  396. -- "array-layout"
  397. We only sketch these here, as they are only suggestive. The code referenced
  398. below illustrates these styles in more detail.
  399. 8.1. List-structure memory representation
  400. Here there are separate records for simple-strings, strings, and
  401. lists. An S-expression of the form ("abc" "de") would require two
  402. records for the simple strings, two for the strings, and two for the
  403. list elements. This is a fairly conventional representation, and
  404. details are omitted here.
  405. 8.2 Array-layout memory representation
  406. Here each S-expression is represented as a contiguous array of bytes.
  407. The first byte codes the "type" of the S-expression:
  408. 01 octet-string
  409. 02 octet-string with display-hint
  410. 03 beginning of list (and 00 is used for "end of list")
  411. Each of the three types is immediately followed by a k-byte integer
  412. indicating the size (in bytes) of the following representation. Here
  413. k is an integer that depends on the implementation, it might be
  414. anywhere from 2 to 8, but would be fixed for a given implementation;
  415. it determines the size of the objects that can be handled. The transport
  416. and canonical representations are independent of the choice of k made by
  417. the implementation.
  418. Although the length of lists are not given in the usual S-expression
  419. notations, it is easy to fill them in when parsing; when you reach a
  420. right-parenthesis you know how long the list representation was, and
  421. where to go back to fill in the missing length.
  422. 8.2.1 Octet string
  423. This is represented as follows:
  424. 01 <length> <octet-string>
  425. For example (here k = 2)
  426. 01 0003 a b c
  427. 8.2.2 Octet-string with display-hint
  428. This is represented as follows:
  429. 02 <length>
  430. 01 <length> <octet-string> /* for display-type */
  431. 01 <length> <octet-string> /* for octet-string */
  432. For example, the S-expression
  433. [gif] #61626364#
  434. would be represented as (with k = 2)
  435. 02 000d
  436. 01 0003 g i f
  437. 01 0004 61 62 63 64
  438. 8.2.3 List
  439. This is represented as
  440. 03 <length> <item1> <item2> <item3> ... <itemn> 00
  441. For example, the list (abc [d]ef (g)) is represented in memory as (with k=2)
  442. 03 001b
  443. 01 0003 a b c
  444. 02 0009
  445. 01 0001 d
  446. 01 0002 e f
  447. 03 0005
  448. 01 0001 g
  449. 00
  450. 00
  451. 9. Code
  452. There is code available for reading and parsing the various
  453. S-expression formats proposed here.
  454. See http://theory.lcs.mit.edu/~rivest/sexp.html
  455. 10. Utilization of S-expressions
  456. This note has described S-expressions in general form. Application writers
  457. may wish to restrict their use of S-expressions in various ways. Here are
  458. some possible restrictions that might be considered:
  459. -- no display-hints
  460. -- no lengths on hexadecimal, quoted-strings, or base-64 encodings
  461. -- no empty lists
  462. -- no empty octet-strings
  463. -- no lists having another list as its first element
  464. -- no base-64 or hexadecimal encodings
  465. -- fixed limits on the size of octet-strings
  466. 11. Historical note
  467. The S-expression technology described here was originally developed
  468. for ``SDSI'' (the Simple Distributed Security Infrastructure by
  469. Lampson and Rivest [SDSI]) in 1996, although the origins clearly date
  470. back to McCarthy's LISP programming language. It was further refined
  471. and improved during the merger of SDSI and SPKI [SPKI] during the
  472. first half of 1997. S-expressions are similar to, but more readable
  473. and flexible than, Bernstein's "net-strings" [BERN].
  474. 12. References
  475. [SDSI] "A Simple Distributed Security Architecture", by
  476. Butler Lampson, and Ronald L. Rivest
  477. http://theory.lcs.mit.edu/~cis/sdsi.html
  478. [SPKI] <a href="http://www.clark.net/pub/cme/html/spki.html">SPKI--A
  479. Simple Public Key Infrastructure</a>
  480. [BERN] Dan Bernstein's "net-strings"; Internet Draft
  481. draft-bernstein-netstrings-02.txt
  482. Author's Address
  483. Ronald L. Rivest
  484. Room 324, 545 Technology Square
  485. MIT Laboratory for Computer Science
  486. Cambridge, MA 02139
  487. rivest@theory.lcs.mit.edu