doc2txt 2.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
  1. .TH DOC2TXT 1
  2. .SH NAME
  3. doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltables
  4. \- extract printable text from Microsoft documents
  5. .SH SYNOPSIS
  6. .B doc2txt
  7. [
  8. .I file.doc
  9. ]
  10. .br
  11. .B doc2ps
  12. [
  13. .I file.doc
  14. ]
  15. .br
  16. .B wdoc2txt
  17. [
  18. .I file.doc
  19. ]
  20. .br
  21. .B xls2txt
  22. [
  23. .I file.xls
  24. ]
  25. .br
  26. .B aux/olefs
  27. [
  28. .B -m
  29. .I mtpt
  30. ]
  31. .I file.doc
  32. .br
  33. .B aux/mswordstrings
  34. .IB mtpt /WordDocument
  35. .br
  36. .B aux/msexceltables
  37. [
  38. .B -qaDnt
  39. ] [
  40. .B -d
  41. .I delim
  42. ] [
  43. .B -c
  44. .I column-range
  45. ] [
  46. .B -w
  47. .I worksheet-range
  48. ]
  49. .IB mtpt /Workbook
  50. .SH DESCRIPTION
  51. .I Doc2txt
  52. is an
  53. .IR rc (1)
  54. script that uses
  55. .I olefs
  56. and
  57. .I mswordstrings
  58. to extract the printable text from the body of a Microsoft Word document
  59. and write it on the standard output.
  60. .I Doc2ps
  61. is similar, but emits PostScript corresponding to the document.
  62. .I Wdoc2txt
  63. is similar to
  64. .IR doc2txt ,
  65. but uses
  66. .IR plumb (1)
  67. to send the output to a new
  68. .IR acme (1)
  69. window instead.
  70. .I Xls2txt
  71. performs a similar function for Microsoft Excel documents.
  72. .PP
  73. Microsoft Office documents are stored in OLE (Object Linking and Embedding)
  74. format, which is a scaled down version of Microsoft's FAT file system.
  75. .I Olefs
  76. presents the contents of an MS Office document as a file system
  77. on
  78. .IR mtpt ,
  79. which defaults to
  80. .BR /mnt/doc .
  81. .I Mswordstrings
  82. or
  83. .I msexceltables
  84. may then be used to parse the files inside, extracting
  85. a text stream.
  86. .I Msexceltables
  87. may be given options to control the formatting of its output.
  88. .TF "\fL-d \fIdelim"
  89. .TP
  90. .B -a
  91. Attempt conversion of non-tabular sheets in the workbook (charts).
  92. .TP
  93. .BI -d " delim
  94. Sets the inter-field delimiter to the string
  95. .IR delim ,
  96. by default a single space.
  97. .TP
  98. .B -D
  99. Enables debugging output.
  100. .TP
  101. .BI -c " range
  102. .I Range
  103. is a comma-separated list of column numbers and ranges.
  104. Ranges are separated by dashes.
  105. Limit processing to just those columns named;
  106. by default all columns are output.
  107. .TP
  108. .B -n
  109. Disables field padding to column width.
  110. .TP
  111. .B -q
  112. Disable quoting of textural fields (see
  113. .IR quote (2).)
  114. .TP
  115. .B -t
  116. Truncate fields to the column width.
  117. .TP
  118. .BI -w " range
  119. .I Range
  120. is a comma-separated list of worksheet numbers and ranges, this
  121. limits the sheets output using the same syntax as the
  122. .B -c
  123. option above.
  124. Suppressed chart pages are always included in the sheet count.
  125. .SH EXAMPLE
  126. Extract pieces of an MS Excel spreadsheet.
  127. .PD 0
  128. .IP
  129. .EX
  130. .SM
  131. aux/olefs report.xls
  132. msexceltables -q -w 1,7,9-14 -c 3-5 -n -d '@' /mnt/doc/Workbook > rpt.txt
  133. unmount /mnt/doc
  134. .EE
  135. .PD
  136. .SH SOURCE
  137. .TF "\fL/sys/src/cmd/aux "
  138. .TP
  139. .B /rc/bin
  140. .BR doc2txt ,
  141. .BR doc2ps ,
  142. .BR wdoc2txt,
  143. and
  144. .BR xls2txt
  145. .TP
  146. .B /sys/src/cmd/aux
  147. the others
  148. .fi
  149. .PD
  150. .SH SEE ALSO
  151. .IR strings (1)
  152. .br
  153. ``Microsoft Word 97 Binary File Format'',
  154. at Microsoft's developer (MSDN) home page.
  155. .br
  156. ``LAOLA Binary Structures'',
  157. .B http://user.cs.tu-berlin.de/~schwartz/pmh
  158. .br
  159. ``OpenOffice.Org's Excel Documentation'',
  160. .br
  161. .B http://sc.openoffice.org/excelfileformat.pdf