doc2txt 2.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
  1. .TH DOC2TXT 1
  2. .SH NAME
  3. doc2txt, xls2txt olefs, mswordstrings msexceltable \- extract printable strings from Microsoft Office documents
  4. .SH SYNOPSIS
  5. .B doc2txt
  6. [
  7. .I file.doc
  8. ]
  9. .br
  10. .B xls2txt
  11. [
  12. .I file.xls
  13. ]
  14. .br
  15. .B aux/olefs
  16. [
  17. .B -m
  18. .I mtpt
  19. ]
  20. .I file.doc
  21. .br
  22. .B aux/mswordstrings
  23. .I /mnt/doc/WordDocument
  24. .br
  25. .B aux/msexceltable
  26. [
  27. .B -aDnt
  28. ] [
  29. .B -d
  30. .I delim
  31. ]
  32. .I /mnt/doc/Workbook
  33. .SH DESCRIPTION
  34. .I Doc2txt
  35. is a shell script that uses
  36. .I olefs
  37. and
  38. .I mswordstrings
  39. to extract the printable text from the body of a Microsoft Word document.
  40. .I Xls2txt
  41. performs a similar function for Microsoft Excel documents.
  42. .PP
  43. Microsoft Office documents are stored in OLE (Object Linking and Embedding)
  44. format, which is a scaled down version of Microsoft's FAT file system.
  45. .I Olefs
  46. presents the contents of an Office document as a file system
  47. on
  48. .IR mtpt ,
  49. which defaults to
  50. .BR /mnt/doc .
  51. .I Mswordstrings
  52. or
  53. .I msexceltables
  54. may then be used to parse the files inside, extracting
  55. a text stream.
  56. .I Msexceltables
  57. may be given options to control the formatting of its output.
  58. .TP
  59. .B -n
  60. Disables field padding to colum width.
  61. .TP
  62. .B -t
  63. Truncate fields to the colum width.
  64. .TP
  65. .B -a
  66. Attempt conversion of non-tabular sheets in the workbook. (charts).
  67. .TP
  68. .BI -d " delim
  69. Sets the interfield delimiter to the string
  70. .IR delim ,
  71. by default a single space.
  72. .TP
  73. .B -D
  74. Enables debugging output.
  75. .SH SOURCE
  76. .B /sys/src/cmd/aux/mswordstrings.c
  77. .br
  78. .B /sys/src/cmd/aux/msexceltables.c
  79. .br
  80. .B /sys/src/cmd/aux/olefs.c
  81. .br
  82. .B /rc/bin/xls2txt
  83. .br
  84. .B /rc/bin/doc2txt
  85. .SH BUGS
  86. .I Msexcelstrings
  87. cannot parse files containing rich text field descriptions or Asian phonetic
  88. pronunciation hints due to a lack of ducumentation on these formats; It has
  89. only been tested on BIFF8 files generated by MS Office 97; Caveat Emptor.
  90. .SH SEE ALSO
  91. .IR strings (1)
  92. .br
  93. ``Microsoft Word 97 Binary File Format'',
  94. available on line at Microsoft's developer home page.
  95. .br
  96. ``LAOLA Binary Structures'',
  97. .I http://snake.cs.tu-berlin.de:8081/~schwartz/pmh
  98. .br
  99. ``OpenOffice.Org's Excel Documentation'',
  100. .I http://sc.openoffice.org/excelfileformat.pdf