utf.html 43 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324
  1. <html>
  2. <title>
  3. data
  4. </title>
  5. <body BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#330088" ALINK="#FF0044">
  6. <H1>Hello World
  7. <br>
  8. or
  9. <br>
  10. &#191;ALPHA&#191;&#191;MU&#191;&#191;ALPHA &#191;&#191;&#191;MUEPSILON
  11. <br>
  12. or
  13. <br>
  14. &#191;&#191;&#191;&#191;&#191; &#191;&#191;
  15. </H1>
  16. <DL><DD><I>Rob Pike<br>
  17. Ken Thompson<br>
  18. <br>&#32;<br>
  19. rob,ken@plan9.bell-labs.com<br>
  20. </I></DL>
  21. <DL><DD><H4>ABSTRACT</H4>
  22. <DL>
  23. <DT><DT>&#32;<DD>
  24. NOTE:<I> Originally appeared, in a slightly different form, in
  25. Proc. of the Winter 1993 USENIX Conf.,
  26. pp. 43-50,
  27. San Diego
  28. </I><DT>&#32;<DD></dl>
  29. <br>
  30. Plan 9 from Bell Labs has recently been converted from ASCII
  31. to an ASCII-compatible variant of the Unicode Standard, a 16-bit character set.
  32. In this paper we explain the reasons for the change,
  33. describe the character set and representation we chose,
  34. and present the programming models and software changes
  35. that support the new text format.
  36. Although we stopped short of full internationalization&#173;for
  37. example, system error messages are in Unixese, not Japanese&#173;we
  38. believe Plan 9 is the first system to treat the representation
  39. of all major languages on a uniform, equal footing throughout all its
  40. software.
  41. </DL>
  42. <H4>Introduction
  43. </H4>
  44. <P>
  45. The world is multilingual but most computer systems
  46. are based on English and ASCII.
  47. The first release of Plan 9 [Pike90], a new distributed operating
  48. system from Bell Laboratories, seemed a good occasion
  49. to correct this chauvinism.
  50. It is easier to make such deep changes when building new systems than
  51. by refitting old ones.
  52. </P>
  53. <P>
  54. The ANSI C standard [ANSIC] contains some guidance on the matter of
  55. `wide' and `multi-byte' characters but falls far short of
  56. solving the myriad associated problems.
  57. We could find no literature on how to convert a
  58. <I>system</I>
  59. to larger character sets, although some individual
  60. <I>programs</I>
  61. had been converted.
  62. This paper reports what we discovered as we
  63. explored the problem of representing multilingual
  64. text at all levels of an operating system,
  65. from the file system and kernel through
  66. the applications and up to the window system
  67. and display.
  68. </P>
  69. <P>
  70. Plan 9 has not been `internationalized':
  71. its manuals are in English,
  72. its error messages are in English,
  73. and it can display text that goes from left to right only.
  74. But before we can address these other problems,
  75. we need to handle, uniformly and comfortably,
  76. the textual representation of all the major written languages.
  77. That subproblem is richer than we had anticipated.
  78. </P>
  79. <H4>Standards
  80. </H4>
  81. <P>
  82. Our first step was to select a standard.
  83. At the time (January 1992),
  84. there were only two viable options:
  85. ISO 10646 [ISO10646] and Unicode [Unicode].
  86. The documents describing both proposals were still in the draft stage.
  87. </P>
  88. <P>
  89. The draft of ISO 10646 was not
  90. very attractive to us.
  91. It defined a sparse set of 32-bit characters,
  92. which would be
  93. hard to implement
  94. and have punitive storage requirements.
  95. Also, the draft attempted to
  96. mollify national interests by allocating
  97. 16-bit subspaces to national committees
  98. to partition individually.
  99. The suggested mode of use was to
  100. ``flip'' between separate national
  101. standards to implement the international standard.
  102. This did not strike us as a sound basis for a character set.
  103. As well, transmitting 32-bit values in a byte stream,
  104. such as in pipes, would be expensive and hard to implement.
  105. Since the standard does not define a byte order for such
  106. transmission, the byte stream would also have to carry
  107. state to enable the values to be recovered.
  108. </P>
  109. <P>
  110. The Unicode Standard is a proposal by a consortium of mostly American
  111. computer companies formed
  112. to protest the technical
  113. failings of ISO 10646.
  114. It defines a uniform 16-bit code based on the
  115. principle of unification:
  116. two characters are the same if they look the
  117. same even though they are from different
  118. languages.
  119. This principle, called Han unification,
  120. allows the large Japanese, Chinese, and Korean
  121. character sets to be packed comfortably into a 16-bit representation.
  122. </P>
  123. <P>
  124. We chose the Unicode Standard for its technical merits and because its
  125. code space was better defined.
  126. Moreover,
  127. the Unicode Consortium was derailing the
  128. ISO 10646 standard.
  129. (Now, in 1995,
  130. ISO 10646 is a standard
  131. with one 16-bit group defined,
  132. which is almost exactly the Unicode Standard.
  133. As most people expected, the two standards bodies
  134. reached a d&eacute;tente and
  135. ISO 10646 and Unicode represent the same character set.)
  136. </P>
  137. <P>
  138. The Unicode Standard defines an adequate character set
  139. but an unreasonable representation.
  140. It states that all characters
  141. are 16 bits wide and are communicated and stored in
  142. 16-bit units.
  143. It also reserves a pair of characters
  144. (hexadecimal FFFE and FEFF) to detect byte order
  145. in transmitted text, requiring state in the byte stream.
  146. (The Unicode Consortium was thinking of files, not pipes.)
  147. To adopt this encoding,
  148. we would have had to convert all text going
  149. into and out of Plan 9 between ASCII and Unicode, which cannot be done.
  150. Within a single program, in command of all its input and output,
  151. it is possible to define characters as 16-bit quantities;
  152. in the context of a networked system with
  153. hundreds of applications on diverse machines
  154. by different manufacturers,
  155. it is impossible.
  156. </P>
  157. <P>
  158. We needed a way to adapt the Unicode Standard to the tools-and-pipes
  159. model of text processing embodied by the Unix system.
  160. To do that, we
  161. needed an ASCII-compatible textual
  162. representation of Unicode characters for transmission
  163. and storage.
  164. In the draft ISO standard there was an informative
  165. (non-required)
  166. Annex
  167. called UTF
  168. that provided a byte stream encoding
  169. of the 32-bit ISO code.
  170. The encoding uses multibyte sequences composed
  171. from the 190 printable characters of Latin-1
  172. to represent character values larger
  173. than 159.
  174. </P>
  175. <P>
  176. The UTF encoding has several good properties.
  177. By far the most important is that
  178. a byte in the ASCII range 0-127 represents
  179. itself in UTF.
  180. Thus UTF is backward compatible with ASCII.
  181. </P>
  182. <P>
  183. UTF has other advantages.
  184. It is a byte encoding and is
  185. therefore byte-order independent.
  186. ASCII control characters appear in the byte stream
  187. only as themselves, never as an element of a sequence
  188. encoding another character,
  189. so newline bytes separate lines of UTF text.
  190. Finally, ANSI C's
  191. <TT>strcmp</TT>
  192. function applied to UTF strings preserves the ordering of Unicode characters.
  193. </P>
  194. <P>
  195. To encode and decode UTF is expensive (involving multiplication,
  196. division, and modulo operations) but workable.
  197. UTF's major disadvantage is that the encoding
  198. is not self-synchronizing.
  199. It is in general impossible to find the character
  200. boundaries in a UTF string without reading from
  201. the beginning of the string, although in practice
  202. control characters such as newlines,
  203. tabs, and blanks provide synchronization points.
  204. </P>
  205. <P>
  206. In August 1992,
  207. X-Open circulated a proposal for another UTF-like
  208. byte encoding of Unicode characters.
  209. Their major concern was that an embedded character
  210. in a file name
  211. (in particular a slash)
  212. could be part of an escape sequence in UTF and
  213. therefore confuse a traditional file system.
  214. Their proposal would allow all 7-bit ASCII characters
  215. to represent themselves
  216. <I>and only themselves</I>
  217. in text.
  218. Multibyte sequences would contain only characters
  219. with the high bit set.
  220. We proposed a modification to the new UTF that
  221. would address our synchronization problem.
  222. Our proposal, which was originally known informally as UTF-2 and FSS-UTF,
  223. is now referred to as UTF-8 and has been approved by ISO to become
  224. Annex P to ISO 10646.
  225. </P>
  226. <P>
  227. The model for text in Plan 9 is chosen from these
  228. three standards*:
  229. </P>
  230. <DL>
  231. <DT><DT>&#32;<DD>
  232. NOTE:<I> * ``That's the nice thing about standards&#173;there's so many to choose from.'' - Andy Tannenbaum (no, the other one)
  233. </I><DT>&#32;<DD></dl>
  234. <br>
  235. the Unicode character set encoded as a byte stream by
  236. UTF-8, from
  237. (soon to be) Annex P of ISO 10646.
  238. Although this mixture may seem like a precarious position for us to adopt,
  239. it is not as bad as it sounds.
  240. ISO 10646 and the Unicode Standard have converged,
  241. other systems such as Linux have adopted the same character set and encoding,
  242. and the general feeling seems to be that Unicode and UTF-8 will be accepted as the way
  243. to exchange text between systems.
  244. The prognosis for wide acceptance is good.
  245. <P>
  246. There are a couple of aspects of the Unicode Standard we have not faced.
  247. One is the issue of right-to-left text such as Hebrew or Arabic.
  248. Since that is an issue of display, not representation, we believe
  249. we can defer that problem for the moment without affecting our
  250. ability to solve it later.
  251. Another issue is diacriticals and `combining characters',
  252. which cause overstriking of multiple Unicode characters.
  253. Although necessary for some scripts, such as Thai, Arabic, and Hebrew,
  254. such characters confuse the issues for Latin languages because they
  255. generate multiple representations for accented characters.
  256. ISO 10646 describes three levels of implementation;
  257. in Plan 9 we decided not to address the issue.
  258. Again, this can be labeled as a display issue and its finer points are still being debated,
  259. so we felt comfortable deferring. Ma&ntilde;ana.
  260. </P>
  261. <P>
  262. Although we converted Plan 9 in the altruistic interests of
  263. serving foreign languages, we have found the large character
  264. set attractive for other reasons. The Unicode Standard includes many
  265. characters&#173;mathematical symbols, scientific notation,
  266. more general punctuation, and more&#173;that we now use
  267. daily in our work. We no longer test our imaginations
  268. to find ways to include non-ASCII symbols in our text;
  269. why type
  270. <TT>:-)</TT>
  271. when you can use the character &#191;?
  272. Most compelling is the ability to absorb documents
  273. and data that contain non-ASCII characters; our browser for the
  274. Oxford English Dictionary
  275. lets us see the dictionary as it really is, with pronunciation
  276. in the IPA font, foreign phrases properly rendered, and so on,
  277. <I>in plain text.</I>
  278. </P>
  279. <P>
  280. In the rest of this paper, except when
  281. stated otherwise, the term `UTF' refers to the UTF-8 encoding
  282. of Unicode characters as adopted by Plan 9.
  283. </P>
  284. <H4>C Compiler
  285. </H4>
  286. <P>
  287. The first program to be converted to UTF
  288. was the C Compiler.
  289. There are two levels of conversion.
  290. On the syntactic level,
  291. input to the C compiler
  292. is UTF; on the semantic level,
  293. the C language needs to define
  294. how compiled programs manipulate
  295. the UTF set.
  296. </P>
  297. <P>
  298. The syntactic part is simple.
  299. The ANSI C language standard defines the
  300. source character set to be ASCII.
  301. Since UTF is backward compatible with ASCII,
  302. the compiler needs little change.
  303. The only places where a larger character set
  304. is allowed are in character constants, strings, and comments.
  305. Since 7-bit ASCII characters can represent only
  306. themselves in UTF,
  307. the compiler does not have to be careful while looking
  308. for the termination of a string or comment.
  309. </P>
  310. <P>
  311. The Plan 9 compiler extends ANSI C to treat any Unicode
  312. character with a value outside of the ASCII range as
  313. an alphabetic.
  314. To a Greek programmer or an English mathematician,
  315. ALPHA is a sensible and now valid variable name.
  316. </P>
  317. <P>
  318. On the semantic level, ANSI C allows,
  319. but does not tie down,
  320. the notion of a
  321. <I>wide character</I>
  322. and admits string and character constants
  323. of this type.
  324. We chose the wide character type to be
  325. <TT>unsigned</TT>
  326. <TT>short</TT>.
  327. In the libraries, the word
  328. <TT>Rune</TT>
  329. is defined by a
  330. <TT>typedef</TT>
  331. to be equivalent to
  332. <TT>unsigned</TT>
  333. <TT>short</TT>
  334. and is
  335. used to signify a Unicode character.
  336. </P>
  337. <P>
  338. There are surprises; for example:
  339. <DL><DT><DD><TT><PRE>
  340. L'x' is 120
  341. 'x' is 120
  342. L'&yuml;' is 255
  343. '&yuml;' is -1, stdio EOF (if char is signed)
  344. L'ALPHA' is 945
  345. 'ALPHA' is illegal
  346. </PRE></TT></DL>
  347. In the string constants,
  348. <DL><DT><DD><TT><PRE>
  349. "&#191;&#191;&#191;&#191;&#191; &#191;&#191;"
  350. L"&#191;&#191;&#191;&#191;&#191; &#191;&#191;",
  351. </PRE></TT></DL>
  352. the former is an array of
  353. <TT>chars</TT>
  354. with 22 elements
  355. and a null byte,
  356. while the latter is an array of
  357. <TT>unsigned</TT>
  358. <TT>shorts</TT>
  359. (<TT>Runes</TT>)
  360. with 8 elements and a null
  361. <TT>Rune</TT>.
  362. </P>
  363. <P>
  364. The Plan 9 library provides an output conversion function,
  365. <TT>print</TT>
  366. (analogous to
  367. <TT>printf</TT>),
  368. with formats
  369. <TT>%c</TT>,
  370. <TT>%C</TT>,
  371. <TT>%s</TT>,
  372. and
  373. <TT>%S</TT>.
  374. Since
  375. <TT>print</TT>
  376. produces text, its output is always UTF.
  377. The character conversion
  378. <TT>%c</TT>
  379. (lower case) masks its argument
  380. to 8 bits before converting to UTF.
  381. Thus
  382. <TT>L'&yuml;'</TT>
  383. and
  384. <TT>'&yuml;'</TT>
  385. printed under
  386. <TT>%c</TT>
  387. will be identical,
  388. but
  389. <TT>L'ALPHA'</TT>
  390. will print as the Unicode
  391. character with decimal value 177.
  392. The character conversion
  393. <TT>%C</TT>
  394. (upper case) masks its argument
  395. to 16 bits before converting to UTF.
  396. Thus
  397. <TT>L'&yuml;'</TT>
  398. and
  399. <TT>L'ALPHA'</TT>
  400. will print correctly under
  401. <TT>%C</TT>,
  402. but
  403. <TT>'&yuml;'</TT>
  404. will not.
  405. The conversion
  406. <TT>%s</TT>
  407. (lower case)
  408. expects a pointer to
  409. <TT>char</TT>
  410. and copies UTF sequences up to a null byte.
  411. The conversion
  412. <TT>%S</TT>
  413. (upper case) expects a pointer to
  414. <TT>Rune</TT>
  415. and
  416. performs sequential
  417. <TT>%C</TT>
  418. conversions until a null
  419. <TT>Rune</TT>
  420. is encountered.
  421. </P>
  422. <P>
  423. Another problem in format conversion
  424. is the definition of
  425. <TT>%10s</TT>:
  426. does the number refer to bytes or characters?
  427. We decided that such formats were most
  428. often used to align output columns and
  429. so made the number count characters.
  430. Some programs, however, use the count
  431. to place blank-padded strings
  432. in fixed-sized arrays.
  433. These programs must be found and corrected.
  434. </P>
  435. <P>
  436. Here is a complete example:
  437. <DL><DT><DD><TT><PRE>
  438. #include &#60;u.h&#62;
  439. char c[] = "&#191;&#191;&#191;&#191;&#191; &#191;&#191;";
  440. Rune s[] = L"&#191;&#191;&#191;&#191;&#191; &#191;&#191;";
  441. main(void)
  442. {
  443. print("%d, %d\n", sizeof(c), sizeof(s));
  444. print("%s\n", c);
  445. print("%S\n", s);
  446. }
  447. </PRE></TT></DL>
  448. </P>
  449. <P>
  450. This program prints
  451. <TT>23,</TT>
  452. <TT>18</TT>
  453. and then two identical lines of
  454. UTF text.
  455. In practice,
  456. <TT>%S</TT>
  457. and
  458. <TT>L"..."</TT>
  459. are rare in programs; one reason is
  460. that most formatted I/O is done in unconverted UTF.
  461. </P>
  462. <H4>Ramifications
  463. </H4>
  464. <P>
  465. All programs in Plan 9 now read and write text as UTF, not ASCII.
  466. This change breaks two deep-rooted symmetries implicit in most C programs:
  467. </P>
  468. <DL COMPACT>
  469. <DT>1.<DD>
  470. A character is no longer a
  471. <TT>char</TT>.
  472. <DT>2.<DD>
  473. The internal representation (Rune) of a character now differs from its
  474. external representation (UTF).
  475. </dl>
  476. <P>
  477. In the sections that follow,
  478. we show how these issues were faced in the layers of
  479. system software from the operating system up to the applications.
  480. The effects are wide-reaching and often surprising.
  481. </P>
  482. <H4>Operating system
  483. </H4>
  484. <P>
  485. Since UTF is the only format for text in Plan 9,
  486. the interface to the operating system had to be converted to UTF.
  487. Text strings cross the interface in several places:
  488. command arguments,
  489. file names,
  490. user names (people can log in using their native name),
  491. error messages,
  492. and miscellaneous minor places such as commands to the I/O system.
  493. Little change was required: null-terminated UTF strings
  494. are equivalent to null-terminated ASCII strings for most purposes
  495. of the operating system.
  496. The library routines described in the next section made that
  497. change straightforward.
  498. </P>
  499. <P>
  500. The window system, once called
  501. <TT>8.5</TT>,
  502. is now rightfully called
  503. <TT>8&#189;</TT>.
  504. </P>
  505. <H4>Libraries
  506. </H4>
  507. <P>
  508. A header file included by all programs (see [Pike92]) declares
  509. the
  510. <TT>Rune</TT>
  511. type to hold 16-bit character values:
  512. <DL><DT><DD><TT><PRE>
  513. typedef unsigned short Rune;
  514. </PRE></TT></DL>
  515. Also defined are several constants relevant to UTF:
  516. <DL><DT><DD><TT><PRE>
  517. enum
  518. {
  519. UTFmax = 3, /* maximum bytes per rune */
  520. Runesync = 0x80, /* can't appear in UTF sequence (&#60;) */
  521. Runeself = 0x80, /* rune==UTF sequence (&#60;) */
  522. Runeerror = 0x80, /* decoding error in UTF */
  523. };
  524. </PRE></TT></DL>
  525. (With the original UTF,
  526. <TT>Runesync</TT>
  527. was hexadecimal 21 and
  528. <TT>Runeself</TT>
  529. was A0.)
  530. <TT>UTFmax</TT>
  531. bytes are sufficient
  532. to hold the UTF encoding of any Unicode character.
  533. Characters of value less than
  534. <TT>Runesync</TT>
  535. only appear in a UTF string as
  536. themselves, never as part of a sequence encoding another character.
  537. Characters of value less than
  538. <TT>Runeself</TT>
  539. encode into single bytes
  540. of the same value.
  541. Finally, when the library detects errors in UTF input&#173;byte sequences
  542. that are not valid UTF sequences&#173;it converts the first byte of the
  543. error sequence to the character
  544. <TT>Runeerror</TT>.
  545. There is little a rune-oriented program can do when given bad data
  546. except exit, which is unreasonable, or carry on.
  547. Originally the conversion routines, described below,
  548. returned errors when given invalid UTF,
  549. but we found ourselves repeatedly checking for errors and ignoring them.
  550. We therefore decided to convert a bad sequence to a valid rune
  551. and continue processing.
  552. (The ANSI C routines, on the other hand, return errors.)
  553. </P>
  554. <P>
  555. This technique does have the unfortunate property that converting
  556. invalid UTF byte strings in and out of runes does not preserve the input,
  557. but this circumstance only occurs when non-textual input is
  558. given to a textual program.
  559. The Unicode Standard defines an error character, value FFFD, to stand for
  560. characters from other sets that it does not represent.
  561. The
  562. <TT>Runeerror</TT>
  563. character is a different concept, related to the encoding rather than the character set, so we
  564. chose a different character for it.
  565. </P>
  566. <P>
  567. The Plan 9 C library contains a number of routines for
  568. manipulating runes.
  569. The first set converts between runes and UTF strings:
  570. <DL><DT><DD><TT><PRE>
  571. extern int runetochar(char*, Rune*);
  572. extern int chartorune(Rune*, char*);
  573. extern int runelen(long);
  574. extern int fullrune(char*, int);
  575. </PRE></TT></DL>
  576. <TT>Runetochar</TT>
  577. translates a single
  578. <TT>Rune</TT>
  579. to a UTF sequence and returns the number of bytes produced.
  580. <TT>Chartorune</TT>
  581. goes the other way, reporting how many bytes were consumed.
  582. <TT>Runelen</TT>
  583. returns the number of bytes in the UTF encoding of a rune.
  584. <TT>Fullrune</TT>
  585. examines a UTF string up to a specified number of bytes
  586. and reports whether the string begins with a complete UTF encoding.
  587. All these routines use the
  588. <TT>Runeerror</TT>
  589. character to work around encoding problems.
  590. </P>
  591. <P>
  592. There is also a set of routines for examining null-terminated UTF strings,
  593. based on the model of the ANSI standard
  594. <TT>str</TT>
  595. routines, but with
  596. <TT>utf</TT>
  597. substituted for
  598. <TT>str</TT>
  599. and
  600. <TT>rune</TT>
  601. for
  602. <TT>chr</TT>:
  603. <DL><DT><DD><TT><PRE>
  604. extern int utflen(char*);
  605. extern char* utfrune(char*, long);
  606. extern char* utfrrune(char*, long);
  607. extern char* utfutf(char*, char*);
  608. </PRE></TT></DL>
  609. <TT>Utflen</TT>
  610. returns the number of runes in a UTF string;
  611. <TT>utfrune</TT>
  612. returns a pointer to the first occurrence of a rune in a UTF string;
  613. and
  614. <TT>utfrrune</TT>
  615. a pointer to the last.
  616. <TT>Utfutf</TT>
  617. searches for the first occurrence of a UTF string in another UTF string.
  618. Given the synchronizing property of UTF-8,
  619. <TT>utfutf</TT>
  620. is the same as
  621. <TT>strstr</TT>
  622. if the arguments point to valid UTF strings.
  623. </P>
  624. <P>
  625. It is a mistake to use
  626. <TT>strchr</TT>
  627. or
  628. <TT>strrchr</TT>
  629. unless searching for a 7-bit ASCII character, that is, a character
  630. less than
  631. <TT>Runeself</TT>.
  632. </P>
  633. <P>
  634. We have no routines for manipulating null-terminated arrays of
  635. <TT>Runes</TT>.
  636. Although they should probably exist for completeness, we have
  637. found no need for them, for the same reason that
  638. <TT>%S</TT>
  639. and
  640. <TT>L"..."</TT>
  641. are rarely used.
  642. </P>
  643. <P>
  644. Most Plan 9 programs use a new buffered I/O library, BIO, in place of
  645. Standard I/O.
  646. BIO contains routines to read and write UTF streams, converting to and from
  647. runes.
  648. <TT>Bgetrune</TT>
  649. returns, as a
  650. <TT>Rune</TT>
  651. within a
  652. <TT>long</TT>,
  653. the next character in the UTF input stream;
  654. <TT>Bputrune</TT>
  655. takes a rune and writes its UTF representation.
  656. <TT>Bungetrune</TT>
  657. puts a rune back into the input stream for rereading.
  658. </P>
  659. <P>
  660. Plan 9 programs use a simple set of macros to process command line arguments.
  661. Converting these macros to UTF automatically updated the
  662. argument processing of most programs.
  663. In general,
  664. argument flag names can no longer be held in bytes and
  665. arrays of 256 bytes cannot be used to hold a set of flags.
  666. </P>
  667. <P>
  668. We have done nothing analogous to ANSI C's locales, partly because
  669. we do not feel qualified to define locales and partly because we remain
  670. unconvinced of that model for dealing with the problems.
  671. That is really more an issue of internationalization than conversion
  672. to a larger character set; on the other hand,
  673. because we have chosen a single character set that encompasses
  674. most languages, some of the need for
  675. locales is eliminated.
  676. (We have a utility,
  677. <TT>tcs</TT>,
  678. that translates between UTF and other character sets.)
  679. </P>
  680. <P>
  681. There are several reasons why our library does not follow the ANSI design
  682. for wide and multi-byte characters.
  683. The ANSI model was designed by a committee, untried, almost
  684. as an afterthought, whereas
  685. we wanted to design as we built.
  686. (We made several major changes to the interface
  687. as we became familiar with the problems involved.)
  688. We disagree with ANSI C's handling of invalid multi-byte sequences.
  689. Also, the ANSI C library is incomplete:
  690. although it contains some crucial routines for handling
  691. wide and multi-byte characters, there are some serious omissions.
  692. For example, our software can exploit
  693. the fact that UTF preserves ASCII characters in the byte stream.
  694. We could remove that assumption by replacing all
  695. calls to
  696. <TT>strchr</TT>
  697. with
  698. <TT>utfrune</TT>
  699. and so on.
  700. (Because of the weaker properties of the original UTF,
  701. we have actually done so.)
  702. ANSI C cannot:
  703. the standard says nothing about the representation, so portable code should
  704. <I>never</I>
  705. call
  706. <TT>strchr</TT>,
  707. yet there is no ANSI equivalent to
  708. <TT>utfrune</TT>.
  709. ANSI C simultaneously invalidates
  710. <TT>strchr</TT>
  711. and offers no replacement.
  712. </P>
  713. <P>
  714. Finally, ANSI did nothing to integrate wide characters
  715. into the I/O system: it gives no method for printing
  716. wide characters.
  717. We therefore needed to invent some things and decided to invent
  718. everything.
  719. In the end, some of our entry points do correspond closely to
  720. ANSI routines&#173;for example
  721. <TT>chartorune</TT>
  722. and
  723. <TT>runetochar</TT>
  724. are similar to
  725. <TT>mbtowc</TT>
  726. and
  727. <TT>wctomb</TT>&#173;but
  728. Plan 9's library defines more functionality, enough
  729. to write real applications comfortably.
  730. </P>
  731. <H4>Converting the tools
  732. </H4>
  733. <P>
  734. The source for our tools and applications had already been converted to
  735. work with Latin-1, so it was `8-bit safe', but the conversion to the Unicode
  736. Standard and UTF is more involved.
  737. Some programs needed no change at all:
  738. <TT>cat</TT>,
  739. for instance,
  740. interprets its argument strings, delivered in UTF,
  741. as file names that it passes uninterpreted to the
  742. <TT>open</TT>
  743. system call,
  744. and then just copies bytes from its input to its output;
  745. it never makes decisions based on the values of the bytes.
  746. (Plan 9
  747. <TT>cat</TT>
  748. has no options such as
  749. <TT>-v</TT>
  750. to complicate matters.)
  751. Most programs, however, needed modest change.
  752. </P>
  753. <P>
  754. It is difficult to
  755. find automatically the places that need attention,
  756. but
  757. <TT>grep</TT>
  758. helps.
  759. Software that uses the libraries conscientiously can be searched
  760. for calls to library routines that examine bytes as characters:
  761. <TT>strchr</TT>,
  762. <TT>strrchr</TT>,
  763. <TT>strstr</TT>,
  764. etc.
  765. Replacing these by calls to
  766. <TT>utfrune</TT>,
  767. <TT>utfrrune</TT>,
  768. and
  769. <TT>utfutf</TT>
  770. is enough to fix many programs.
  771. Few tools actually need to operate on runes internally;
  772. more typically they need only to look for the final slash in a file
  773. name and similar trivial tasks.
  774. Of the 170 C source programs in the top levels of
  775. <TT>/sys/src/cmd</TT>,
  776. only 23 now contain the word
  777. <TT>Rune</TT>.
  778. </P>
  779. <P>
  780. The programs that
  781. <I>do</I>
  782. store runes internally
  783. are mostly those whose
  784. <I>raison</I>
  785. <I>d'&ecirc;tre</I>
  786. is character manipulation:
  787. <TT>sam</TT>
  788. (the text editor),
  789. <TT>sed</TT>,
  790. <TT>sort</TT>,
  791. <TT>tr</TT>,
  792. <TT>troff</TT>,
  793. <TT>8&#189;</TT>
  794. (the window system and terminal emulator),
  795. and so on.
  796. To decide whether to compute using runes
  797. or UTF-encoded byte strings requires balancing the cost of converting
  798. the data when read and written
  799. against the cost of converting relevant text on demand.
  800. For programs such as editors that run a long time with a relatively
  801. constant dataset, runes are the better choice.
  802. There are space considerations too, but they are more complicated:
  803. plain ASCII text grows when converted to runes; UTF-encoded Japanese
  804. shrinks.
  805. </P>
  806. <P>
  807. Again, it is hard to automate the conversion of a program from
  808. <TT>chars</TT>
  809. to
  810. <TT>Runes</TT>.
  811. It is not enough just to change the type of variables; the assumption
  812. that bytes and characters are equivalent can be insidious.
  813. For instance, to clear a character array by
  814. <DL><DT><DD><TT><PRE>
  815. memset(buf, 0, BUFSIZE)
  816. </PRE></TT></DL>
  817. becomes wrong if
  818. <TT>buf</TT>
  819. is changed from an array of
  820. <TT>chars</TT>
  821. to an array of
  822. <TT>Runes</TT>.
  823. Any program that indexes tables based on character values needs
  824. rethinking.
  825. Consider
  826. <TT>tr</TT>,
  827. which originally used multiple 256-byte arrays for the mapping.
  828. The na&iuml;ve conversion would yield multiple 65536-rune arrays.
  829. Instead Plan 9
  830. <TT>tr</TT>
  831. saves space by building in effect
  832. a run-encoded version of the map.
  833. </P>
  834. <P>
  835. <TT>Sort</TT>
  836. has related problems.
  837. The cooperation of UTF and
  838. <TT>strcmp</TT>
  839. means that a simple sort&#173;one with no options&#173;can be done
  840. on the original UTF strings using
  841. <TT>strcmp</TT>.
  842. With sorting options enabled, however,
  843. <TT>sort</TT>
  844. may need to convert its input to runes: for example,
  845. option
  846. <TT>-tALPHA</TT>
  847. requires searching for alphas in the input text to
  848. crack the input into fields.
  849. The field specifier
  850. <TT>+3.2</TT>
  851. refers to 2 runes beyond the third field.
  852. Some of the other options are hopelessly provincial:
  853. consider the case-folding and dictionary order options
  854. (Japanese doesn't even have an official dictionary order) or
  855. <TT>-M</TT>
  856. which compares by case-insensitive English month name.
  857. Handling these options involves the
  858. larger issues of internationalization and is beyond the scope
  859. of this paper and our expertise.
  860. Plan 9
  861. <TT>sort</TT>
  862. works sensibly with options that make sense relative to the input.
  863. The simple and most important options are, however, usually meaningful.
  864. In particular,
  865. <TT>sort</TT>
  866. sorts UTF into the same order that
  867. <TT>look</TT>
  868. expects.
  869. </P>
  870. <P>
  871. Regular expression-matching algorithms need rethinking to
  872. be applied to UTF text.
  873. Deterministic automata are usually applied to bytes;
  874. converting them to operate on variable-sized byte sequences is awkward.
  875. On the other hand, converting the input stream to runes adds measurable
  876. expense
  877. and the state tables expand
  878. from size 256 to 65536; it can be expensive just to generate them.
  879. For simple string searching,
  880. the Boyer-Moore algorithm works with UTF provided the input is
  881. guaranteed to be only valid UTF strings; however, it does not work
  882. with the old UTF encoding.
  883. At a more mundane level, even character classes are harder:
  884. the usual bit-vector representation within a non-deterministic automaton
  885. is unwieldy with 65536 characters in the alphabet.
  886. </P>
  887. <P>
  888. We compromised.
  889. An existing library for compiling and executing regular expressions
  890. was adapted to work on runes, with two entry points for searching
  891. in arrays of runes and arrays of chars (the pattern is always UTF text).
  892. Character classes are represented internally as runs of runes;
  893. the reserved value
  894. <TT>FFFF</TT>
  895. marks the end of the class.
  896. Then
  897. <I>all</I>
  898. utilities that use regular expressions&#173;editors,
  899. <TT>grep</TT>,
  900. <TT>awk</TT>,
  901. etc.&#173;except the shell, whose notation
  902. was grandfathered, were converted to use the library.
  903. For some programs, there was a concomitant loss of performance,
  904. but there was also a strong advantage.
  905. To our knowledge, Plan 9 is the only Unix-like system
  906. that has a single definition and implementation of
  907. regular expressions; patterns are written and interpreted
  908. identically by all the programs in the system.
  909. </P>
  910. <P>
  911. A handful of programs have the notion of character built into them
  912. so strongly as to confuse the issue of what they should do with UTF input.
  913. Such programs were treated as individual special cases.
  914. For example,
  915. <TT>wc</TT>
  916. is, by default, unchanged in behavior and output; a new option,
  917. <TT>-r</TT>,
  918. counts the number of correctly encoded runes&#173;valid UTF sequences&#173;in
  919. its input;
  920. <TT>-b</TT>
  921. the number of invalid sequences.
  922. </P>
  923. <P>
  924. It took us several months to convert all the software in the system
  925. to the Unicode Standard and the old UTF.
  926. When we decided to convert from that to the new UTF,
  927. only three things needed to be done.
  928. First, we rewrote the library routines to encode and decode the
  929. new UTF. This took an evening.
  930. Next, we converted all the files containing UTF
  931. to the new encoding.
  932. We wrote a trivial program to look for non-ASCII bytes in
  933. text files and used a Plan 9 program called
  934. <TT>tcs</TT>
  935. (translate character set) to change encodings.
  936. Finally, we recompiled all the system software;
  937. the library interface was unchanged, so recompilation was sufficient
  938. to effect the transformation.
  939. The second two steps were done concurrently and took an afternoon.
  940. We concluded that the actual encoding is relatively unimportant to the
  941. software; the adoption of large characters and a byte-stream encoding
  942. <I>per</I>
  943. <I>se</I>
  944. are much deeper issues.
  945. </P>
  946. <H4>Graphics and fonts
  947. </H4>
  948. <P>
  949. Plan 9 provides only minimal support for plain text terminals.
  950. It is instead designed to be used with all character input and
  951. output mediated by a window system such as
  952. <TT>8&#189;</TT>.
  953. The window system and related software are responsible for the
  954. display of UTF text as Unicode character images.
  955. For plain text, the window system must provide a user-settable
  956. <I>font</I>
  957. that provides a (possibly empty) picture for each Unicode character.
  958. Fancier applications that use bold and Italic characters
  959. need multiple fonts storing multiple pictures for each
  960. Unicode value.
  961. All the issues are apparent, though,
  962. in just the problem of
  963. displaying a single image for each character, that is, the
  964. Unicode equivalent of a plain text terminal.
  965. With 128 or even 256 characters, a font can be just
  966. an array of bitmaps. With 65536 characters,
  967. a more sophisticated design is necessary. To store the ideographs
  968. for just Japanese as 16&#191;16&#191;1 bit images,
  969. the smallest they can reasonably be, takes over a quarter of a
  970. megabyte. Make the images a little larger, store more bits per
  971. pixel, and hold a copy in every running application, and the
  972. memory cost becomes unreasonable.
  973. </P>
  974. <P>
  975. The structure of the bitmap graphics services is described at length elsewhere
  976. [Pike91].
  977. In summary, the memory holding the bitmaps is stored in the same machine that has
  978. the display, mouse, and keyboard: the terminal in Plan 9 terminology,
  979. the workstation in others'.
  980. Access to that memory and associated services is provided
  981. by device files served by system
  982. software on the terminal. One of those files,
  983. <TT>/dev/bitblt</TT>,
  984. interprets messages written upon it as requests for actions
  985. corresponding to entry points in the graphics library:
  986. allocate a bitmap, execute a raster operation, draw a text string, etc.
  987. The window system
  988. acts as a multiplexer that mediates access to the services
  989. and resources of the terminal by simulating in each client window
  990. a set of files mirroring those provided by the system.
  991. That is, each window has a distinct
  992. <TT>/dev/mouse</TT>,
  993. <TT>/dev/bitblt</TT>,
  994. and so on through which applications drive graphical
  995. input and output.
  996. </P>
  997. <P>
  998. One of the resources managed by
  999. <TT>8&#189;</TT>
  1000. and the terminal is the set of active
  1001. <I>subfonts.</I>
  1002. Each subfont holds the
  1003. bitmaps and associated data structures for a sequential set of Unicode
  1004. characters.
  1005. Subfonts are stored in files and loaded into the terminal by
  1006. <TT>8&#189;</TT>
  1007. or an application.
  1008. For example, one subfont
  1009. might hold the images of the first 256 characters of the Unicode space,
  1010. corresponding to the Latin-1 character set;
  1011. another might hold the standard phonetic character set, Unicode characters
  1012. with value 0250 to 02E9.
  1013. These files are collected in directories corresponding to typefaces:
  1014. <TT>/lib/font/bit/pelm</TT>
  1015. contains the Pellucida Monospace character set, with subfonts holding
  1016. the Latin-1, Greek, Cyrillic and other components of the typeface.
  1017. A suffix on subfont files encodes (in a subfont-specific
  1018. way) the size of the images:
  1019. <TT>/lib/font/bit/pelm/latin1.9</TT>
  1020. contains the Latin-1 Pellucida Monospace characters with lower
  1021. case letters 9 pixels high;
  1022. <TT>/lib/font/bit/jis/jis5400.16</TT>
  1023. contains 16-pixel high
  1024. ideographs starting at Unicode value 5400.
  1025. </P>
  1026. <P>
  1027. The subfonts do not identify which portion of the Unicode space
  1028. they cover. Instead, a
  1029. font file, in plain text,
  1030. describes how to assemble subfonts into a complete
  1031. character set.
  1032. The font file is presented as an argument to the window system
  1033. to determine how plain text is displayed in text windows and
  1034. applications.
  1035. Here is the beginning of the font file
  1036. <TT>/lib/font/bit/pelm/jis.9.font</TT>,
  1037. which describes the layout of a font covering that portion of
  1038. the Unicode Standard for which we have characters of typical
  1039. display size, using Japanese characters
  1040. to cover the Han space:
  1041. <DL><DT><DD><TT><PRE>
  1042. 18 14
  1043. 0x0000 0x00FF latin1.9
  1044. 0x0100 0x017E latineur.9
  1045. 0x0250 0x02E9 ipa.9
  1046. 0x0386 0x03F5 greek.9
  1047. 0x0400 0x0475 cyrillic.9
  1048. 0x2000 0x2044 ../misc/genpunc.9
  1049. 0x2070 0x208E supsub.9
  1050. 0x20A0 0x20AA currency.9
  1051. 0x2100 0x2138 ../misc/letterlike.9
  1052. 0x2190 0x21EA ../misc/arrows
  1053. 0x2200 0x227F ../misc/math1
  1054. 0x2280 0x22F1 ../misc/math2
  1055. 0x2300 0x232C ../misc/tech
  1056. 0x2500 0x257F ../misc/chart
  1057. 0x2600 0x266F ../misc/ding
  1058. </PRE></TT></DL>
  1059. <DL><DT><DD><TT><PRE>
  1060. 0x3000 0x303f ../jis/jis3000.16
  1061. 0x30a1 0x30fe ../jis/katakana.16
  1062. 0x3041 0x309e ../jis/hiragana.16
  1063. 0x4e00 0x4fff ../jis/jis4e00.16
  1064. 0x5000 0x51ff ../jis/jis5000.16
  1065. ...
  1066. </PRE></TT></DL>
  1067. The first two numbers set the interline spacing of the font (18
  1068. pixels) and the distance from the baseline to the top of the
  1069. line (14 pixels).
  1070. When characters are displayed, they are placed so as best
  1071. to fit within those constraints; characters
  1072. too large to fit will be truncated.
  1073. The rest of the file associates subfont files
  1074. with portions of Unicode space.
  1075. The first four such files are in the Pellucida Monospace typeface
  1076. and directory; others reside in other directories. The file names
  1077. are relative to the font file's own location.
  1078. </P>
  1079. <P>
  1080. There are several advantages to this two-level structure.
  1081. First, it simultaneously breaks the huge Unicode space into manageable
  1082. components and provides a unifying architecture for
  1083. assembling fonts from disjoint pieces.
  1084. Second, the structure promotes sharing.
  1085. For example, we have only one set of Japanese
  1086. characters but dozens of typefaces for the Latin-1 characters,
  1087. and this structure permits us to store only one copy of the
  1088. Japanese set but use it with any Roman typeface.
  1089. Also, customization is easy.
  1090. English-speaking users who don't need Japanese characters
  1091. but may want to read an on-line Oxford English Dictionary can
  1092. assemble a custom font with the
  1093. Latin-1 (or even just ASCII) characters and the International
  1094. Phonetic Alphabet (IPA).
  1095. Moreover, to do so requires just editing a plain text file,
  1096. not using a special font editing tool.
  1097. Finally, the structure guides the design of
  1098. caching protocols to improve performance and memory usage.
  1099. </P>
  1100. <P>
  1101. To load a complete Unicode character set into each application
  1102. would consume too
  1103. much memory and, particularly on slow terminal lines, would take
  1104. unreasonably long.
  1105. Instead, Plan 9 assembles a multi-level cache structure for
  1106. each font.
  1107. An application opens a font file, reads and parses it,
  1108. and allocates a data structure.
  1109. A message written to
  1110. <TT>/dev/bitblt</TT>
  1111. allocates an associated structure held in the terminal, in particular,
  1112. a bitmap to act as a cache
  1113. for recently used character images.
  1114. Other messages copy these images to bitmaps such as the screen
  1115. by loading characters from subfonts into the cache on demand and
  1116. from there to the destination bitmap.
  1117. The protocol to draw characters is in terms of cache indices,
  1118. not Unicode character number or UTF sequences.
  1119. These details are hidden from the application, which instead
  1120. sees only a subroutine to draw a string in a bitmap from a
  1121. given font, functions to discover character size information,
  1122. and routines to allocate and to free fonts.
  1123. </P>
  1124. <P>
  1125. As needed, whole
  1126. subfonts are opened by the graphics library, read, and then downloaded
  1127. to the terminal.
  1128. They are held open by the library in an LRU-replacement list.
  1129. Even when the program closes a subfont, it is retained
  1130. in the terminal for later use.
  1131. When the application opens the subfont, it asks the terminal
  1132. if it already has a copy to avoid reading it from the file
  1133. server if possible.
  1134. This level of cache has the property that the bitmaps for, say,
  1135. all the Japanese characters are stored only once, in the terminal;
  1136. the applications read only size and width information from the terminal
  1137. and share the images.
  1138. </P>
  1139. <P>
  1140. The sizes of the character and subfont caches held by the
  1141. application are adaptive.
  1142. A simple algorithm monitors the cache miss rate to enlarge and
  1143. shrink the caches as required.
  1144. The size of the character cache is limited to 2048 images maximum,
  1145. which in practice seems enough even for Japanese text.
  1146. For plain ASCII-like text it naturally stays around 128 images.
  1147. </P>
  1148. <P>
  1149. This mechanism sounds complicated but is implemented by only about
  1150. 500 lines in the library and considerably less in each of the
  1151. terminal's graphics driver and
  1152. <TT>8&#189;</TT>.
  1153. It has the advantage that only characters that are
  1154. being used are loaded into memory.
  1155. It is also efficient: if the characters being drawn
  1156. are in the cache the extra overhead is negligible.
  1157. It works particularly well for alphabetic character sets,
  1158. but also adapts on demand for ideographic sets.
  1159. When a user first looks at Japanese text, it takes a few
  1160. seconds to read all the font data, but thereafter the
  1161. text is drawn almost as fast as regular text (the images
  1162. are larger, so draw a little slower).
  1163. Also, because the bitmaps are remembered by the terminal,
  1164. if a second application then looks at Japanese text
  1165. it starts faster than the first.
  1166. </P>
  1167. <P>
  1168. We considered
  1169. building a `font server'
  1170. to cache character images and associated data
  1171. for the applications, the window system, and the terminal.
  1172. We rejected this design because, although isolating
  1173. many of the problems of font management into a separate program,
  1174. it didn't simplify the applications.
  1175. Moreover, in a distributed system such as Plan 9 it is easy
  1176. to have too many special purpose servers.
  1177. Making the management of the fonts the concern of only
  1178. the essential components simplifies the system and makes
  1179. bootstrapping less intricate.
  1180. </P>
  1181. <H4>Input
  1182. </H4>
  1183. <P>
  1184. A completely different problem is how to type Unicode characters
  1185. as input to the system.
  1186. We selected an unused key on our ASCII keyboards
  1187. to serve as a prefix for multi-keystroke
  1188. sequences that generate Unicode characters.
  1189. For example, the character
  1190. <TT>&uuml;</TT>
  1191. is generated by the prefix key
  1192. (typically
  1193. <TT>ALT</TT>
  1194. or
  1195. <TT>Compose</TT>)
  1196. followed by a double quote and a lower-case
  1197. <TT>u</TT>.
  1198. When that character is read by the application, from the file
  1199. <TT>/dev/cons</TT>,
  1200. it is of course presented as its UTF encoding.
  1201. Such sequences generate characters from an arbitrary set that
  1202. includes all of Latin-1 plus a selection of mathematical
  1203. and technical characters.
  1204. An arbitrary Unicode character may be generated by typing the prefix,
  1205. an upper case X, and four hexadecimal digits that identify
  1206. the Unicode value.
  1207. </P>
  1208. <P>
  1209. These simple mechanisms are adequate for most of our day-to-day needs:
  1210. it's easy to remember to type `ALT 1 2' for &#189; or `ALT accent letter'
  1211. for accented Latin letters.
  1212. For the occasional unusual character, the cut and paste features of
  1213. <TT>8&#189;</TT>
  1214. serve well. A program called (perhaps misleadingly)
  1215. <TT>unicode</TT>
  1216. takes as argument a hexadecimal value, and prints the UTF representation of that character,
  1217. which may then be picked up with the mouse and used as input.
  1218. </P>
  1219. <P>
  1220. These methods
  1221. are clearly unsatisfactory when working in a non-English language.
  1222. In the native country of such a language
  1223. the appropriate keyboard is likely to be at hand.
  1224. But it's also reasonable&#173;especially now that the system handles Unicode characters&#173;to
  1225. work in a language foreign to the keyboard.
  1226. </P>
  1227. <P>
  1228. For alphabetic languages such as Greek or Russian, it is
  1229. straightforward to construct a program that does phonetic substitution,
  1230. so that, for example, typing a Latin `a' yields the Greek `ALPHA'.
  1231. Within Plan 9, such a program can be inserted transparently
  1232. between the real keyboard and a program such as the window system,
  1233. providing a manageable input device for such languages.
  1234. </P>
  1235. <P>
  1236. For ideographic languages such as Chinese or Japanese the problem is harder.
  1237. Native users of such languages have adopted methods for dealing with
  1238. Latin keyboards that involve a hybrid technique based on phonetics
  1239. to generate a list of possible symbols followed by menu selection to
  1240. choose the desired one.
  1241. Such methods can be
  1242. effective, but their design must be rooted in information about
  1243. the language unknown to non-native speakers.
  1244. (<TT>Cxterm</TT>,
  1245. a Chinese terminal emulator built by and for
  1246. Chinese programmers,
  1247. employs such a technique
  1248. [Pong and Zhang].)
  1249. Although the technical problem of implementing such a device
  1250. is easy in Plan 9&#173;it is just an elaboration of the technique for
  1251. alphabetic languages&#173;our lack of familiarity with such languages
  1252. has restrained our enthusiasm for building one.
  1253. </P>
  1254. <P>
  1255. The input problem is technically the least interesting but perhaps
  1256. emotionally the most important of the problems of converting a system
  1257. to an international character set.
  1258. Beyond that remain the deeper problems of internationalization
  1259. such as multi-lingual error messages and command names,
  1260. problems we are not qualified to solve.
  1261. With the ability to treat text of most languages on an equal
  1262. footing, though, we can begin down that path.
  1263. Perhaps people in non-English speaking countries will
  1264. consider adopting Plan 9, solving the input problem locally&#173;perhaps
  1265. just by plugging in their local terminals&#173;and begin to use
  1266. a system with at least the capacity to be international.
  1267. </P>
  1268. <H4>Acknowledgements
  1269. </H4>
  1270. <P>
  1271. Dennis Ritchie provided consultation and encouragement.
  1272. Bob Flandrena converted most of the standard tools to UTF.
  1273. Brian Kernighan suffered cheerfully with several
  1274. inadequate implementations and converted
  1275. <TT>troff</TT>
  1276. to UTF.
  1277. Rich Drechsler converted his Postscript driver to UTF.
  1278. John Hobby built the Postscript &#191;.
  1279. We thank them all.
  1280. </P>
  1281. <H4>References
  1282. </H4>
  1283. <br>&#32;<br>
  1284. [ANSIC] <I>American National Standard for Information Systems -
  1285. Programming Language C</I>, American National Standards Institute, Inc.,
  1286. New York, 1990.
  1287. <br>&#32;<br>
  1288. [ISO10646]
  1289. ISO/IEC DIS 10646-1:1993
  1290. <I>Information technology -
  1291. Universal Multiple-Octet Coded Character Set (UCS) &#173;
  1292. Part 1: Architecture and Basic Multilingual Plane</I>.
  1293. <br>&#32;<br>
  1294. [Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
  1295. ``Plan 9 from Bell Labs'',
  1296. UKUUG Proc. of the Summer 1990 Conf.,
  1297. London, England,
  1298. 1990.
  1299. <br>&#32;<br>
  1300. [Pike91] R. Pike, ``8&#189;, The Plan 9 Window System'', USENIX Summer
  1301. Conf. Proc., Nashville, 1991, reprinted in this volume.
  1302. <br>&#32;<br>
  1303. [Pike92] R. Pike, ``How to Use the Plan 9 C Compiler'', this volume.
  1304. <br>&#32;<br>
  1305. [Pong and Zhang] Man-Chi Pong and Yongguang Zhang, ``cxterm:
  1306. A Chinese Terminal Emulator for the X Window System'',
  1307. Software&#173;Practice and Experience,
  1308. Vol 22(1), 809-926, October 1992.
  1309. <br>&#32;<br>
  1310. [Unicode]
  1311. <I>The Unicode Standard,
  1312. Worldwide Character Encoding,
  1313. Version 1.0, Volume 1</I>,
  1314. The Unicode Consortium,
  1315. Addison Wesley,
  1316. New York,
  1317. 1991.
  1318. <br>&#32;<br>
  1319. <A href=http://www.lucent.com/copyright.html>
  1320. Copyright</A> &#169; 2000 Lucent Technologies Inc. All rights reserved.
  1321. </body></html>