doc2txt 2.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
  1. .TH DOC2TXT 1
  2. .SH NAME
  3. doc2txt, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltable \- extract printable strings from Microsoft Office documents
  4. .SH SYNOPSIS
  5. .B doc2txt
  6. [
  7. .I file.doc
  8. ]
  9. .br
  10. .B wdoc2txt
  11. [
  12. .I file.doc
  13. ]
  14. .br
  15. .B xls2txt
  16. [
  17. .I file.xls
  18. ]
  19. .br
  20. .B aux/olefs
  21. [
  22. .B -m
  23. .I mtpt
  24. ]
  25. .I file.doc
  26. .br
  27. .B aux/mswordstrings
  28. .I /mnt/doc/WordDocument
  29. .br
  30. .B aux/msexceltable
  31. [
  32. .B -aDnt
  33. ] [
  34. .B -d
  35. .I delim
  36. ]
  37. .B -w
  38. .I worksheet-range
  39. ]
  40. .I /mnt/doc/Workbook
  41. .SH DESCRIPTION
  42. .I Doc2txt
  43. is an
  44. .IR rc (1)
  45. script that uses
  46. .I olefs
  47. and
  48. .I mswordstrings
  49. to extract the printable text from the body of a Microsoft Word document and write it on the standard output.
  50. .I Wdoc2txt
  51. is similar, but uses
  52. .IR plumb (1)
  53. to send the output to a new
  54. .IR acme (1)
  55. window instead.
  56. .I Xls2txt
  57. performs a similar function for Microsoft Excel documents.
  58. .PP
  59. Microsoft Office documents are stored in OLE (Object Linking and Embedding)
  60. format, which is a scaled down version of Microsoft's FAT file system.
  61. .I Olefs
  62. presents the contents of an Office document as a file system
  63. on
  64. .IR mtpt ,
  65. which defaults to
  66. .BR /mnt/doc .
  67. .I Mswordstrings
  68. or
  69. .I msexceltables
  70. may then be used to parse the files inside, extracting
  71. a text stream.
  72. .I Msexceltables
  73. may be given options to control the formatting of its output.
  74. .TP
  75. .B -n
  76. Disables field padding to colum width.
  77. .TP
  78. .B -t
  79. Truncate fields to the colum width.
  80. .TP
  81. .B -a
  82. Attempt conversion of non-tabular sheets in the workbook. (charts).
  83. .TP
  84. .BI -d " delim
  85. Sets the interfield delimiter to the string
  86. .IR delim ,
  87. by default a single space.
  88. .TP
  89. .B -D
  90. Enables debugging output.
  91. .TP
  92. .BI -w " worksheet-spec
  93. Specifies which worksheets to process, by default all tabular sheets are
  94. output \- suspressed chart pages are always included in the sheet count.
  95. Arbitary lists of pages or page ranges may be given, individual pages
  96. are seperated by commas, sheet ranges are seperated by a minus.
  97. .SH EXAMPLE
  98. .EX
  99. aux/olefs report.xls
  100. msexceltables -w 1,7,9-14,3-4 -n -d '@' /mnt/doc/Workbook
  101. unmount /mnt/doc
  102. .EE
  103. .SH SOURCE
  104. .B /rc/bin/doc2txt
  105. .br
  106. .B /rc/bin/wdoc2txt
  107. .br
  108. .B /rc/bin/xls2txt
  109. .br
  110. .B /sys/src/cmd/aux/msexceltables.c
  111. .br
  112. .B /sys/src/cmd/aux/mswordstrings.c
  113. .br
  114. .B /sys/src/cmd/aux/olefs.c
  115. .SH SEE ALSO
  116. .IR strings (1)
  117. .br
  118. ``Microsoft Word 97 Binary File Format'',
  119. available on line at Microsoft's developer home page.
  120. .br
  121. ``LAOLA Binary Structures'',
  122. .I http://snake.cs.tu-berlin.de:8081/~schwartz/pmh
  123. .br
  124. ``OpenOffice.Org's Excel Documentation'',
  125. .I http://sc.openoffice.org/excelfileformat.pdf