1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071 |
- Unicode support in busybox
- There are several scenarios where we need to handle unicode
- correctly.
- Shell input
- We want to correctly handle input of unicode characters.
- There are several problems with it. Just handling input
- as sequence of bytes would break any editing. This was fixed
- and now lineedit operates on the array of wchar_t's.
- But we also need to handle the following problematic moments:
- * It is unreasonable to expect that output device supports
- _any_ unicode chars. Perhaps we need to avoid printing
- those chars which are not supported by output device.
- Examples: chars which are not present in the font,
- chars which are not assigned in unicode,
- combining chars (especially trying to combine bad pairs:
- a_chinese_symbol + "combining grave accent" = ??!)
- * We need to account for the fact that unicode chars have
- different widths: 0 for combining chars, 1 for usual,
- 2 for ideograms (are there 3+ wide chars?).
- * Bidirectional handling. If user wants to echo a phrase
- in Hebrew, he types: echo "srettel werbeH"
- Editors (vi, ed)
- This case is a bit similar to "shell input", but unlike shell,
- editors may encounter many more unexpected unicode sequences
- (try to load a random binary file...), and they need to preserve
- them, unlike shell which can afford to drop bogus input.
- more, less
- Need to correctly display any input file. Ideally, with
- ASCII/unicode/filtered_unicode option or keyboard switch.
- Note: need to handle tabs and backspaces specially
- (bksp is for manpage compat).
- cut, fold, watch
- May need ability to cut unicode string to specified number of wchars
- and/or to specified screen width. Need to handle tabs specially.
- sed, awk, grep
- Handle unicode-aware regexp match
- ls (multi-column display)
- ls will fail to line up columnar output if it will not account
- for character widths (and maybe filter out some of them, see
- above). OTOH, non-columnar views (ls -1, ls -l, ls | car)
- should NOT filter out bad unicode (but need to filter out
- control chars (coreutils does that). Note that unlike more/less,
- tabs and backspaces need not special handling.
- top, ps
- Need to perform filtering similar to ls.
- Filename display (in error messages and elsewhere)
- Need to perform filtering similar to ls.
- TODO: write an email to Asmus Freytag (asmus@unicode.org),
- author of http://unicode.org/reports/tr11/
|