1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798 |
- .TH UTF 6
- .SH NAME
- UTF, Unicode, ASCII, rune \- character set and format
- .SH DESCRIPTION
- The Plan 9 character set and representation are
- based on the Unicode Standard and on the ISO multibyte
- .SM UTF-8
- encoding (Universal Character
- Set Transformation Format, 8 bits wide).
- The Unicode Standard represents its characters in 16
- bits;
- .SM UTF-8
- represents such
- values in an 8-bit byte stream.
- Throughout this manual,
- .SM UTF-8
- is shortened to
- .SM UTF.
- .PP
- In Plan 9, a
- .I rune
- is a 16-bit quantity representing a Unicode character.
- Internally, programs may store characters as runes.
- However, any external manifestation of textual information,
- in files or at the interface between programs, uses a
- machine-independent, byte-stream encoding called
- .SM UTF.
- .PP
- .SM UTF
- is designed so the 7-bit
- .SM ASCII
- set (values hexadecimal 00 to 7F),
- appear only as themselves
- in the encoding.
- Runes with values above 7F appear as sequences of two or more
- bytes with values only from 80 to FF.
- .PP
- The
- .SM UTF
- encoding of the Unicode Standard is backward compatible with
- .SM ASCII\c
- :
- programs presented only with
- .SM ASCII
- work on Plan 9
- even if not written to deal with
- .SM UTF,
- as do
- programs that deal with uninterpreted byte streams.
- However, programs that perform semantic processing on
- .SM ASCII
- graphic
- characters must convert from
- .SM UTF
- to runes
- in order to work properly with non-\c
- .SM ASCII
- input.
- See
- .IR rune (2).
- .PP
- Letting numbers be binary,
- a rune x is converted to a multibyte
- .SM UTF
- sequence
- as follows:
- .PP
- 01. x in [00000000.0bbbbbbb] → 0bbbbbbb
- .br
- 10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
- .br
- 11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
- .br
- .PP
- Conversion 01 provides a one-byte sequence that spans the
- .SM ASCII
- character set in a compatible way.
- Conversions 10 and 11 represent higher-valued characters
- as sequences of two or three bytes with the high bit set.
- Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open.
- When there are multiple ways to encode a value, for example rune 0,
- the shortest encoding is used.
- .PP
- In the inverse mapping,
- any sequence except those described above
- is incorrect and is converted to rune hexadecimal FFFD.
- .SH FILES
- .TF "/lib/unicode "
- .TP
- .B /lib/unicode
- table of characters and descriptions, suitable for
- .IR look (1).
- .SH "SEE ALSO"
- .IR ascii (1),
- .IR tcs (1),
- .IR rune (2),
- .IR keyboard (6),
- .IR "The Unicode Standard" .
|