doc2txt 2.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
  1. .TH DOC2TXT 1
  2. .SH NAME
  3. doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltables
  4. \- extract printable text from Microsoft documents
  5. .SH SYNOPSIS
  6. .B doc2txt
  7. [
  8. .I file.doc
  9. ]
  10. .br
  11. .B doc2ps
  12. [
  13. .I file.doc
  14. ]
  15. .br
  16. .B wdoc2txt
  17. [
  18. .I file.doc
  19. ]
  20. .br
  21. .B xls2txt
  22. [
  23. .I file.xls
  24. ]
  25. .br
  26. .B aux/olefs
  27. [
  28. .B -m
  29. .I mtpt
  30. ]
  31. .I file.doc
  32. .br
  33. .B aux/mswordstrings
  34. .IB mtpt /WordDocument
  35. .br
  36. .B aux/msexceltables
  37. [
  38. .B -aDnt
  39. ] [
  40. .B -d
  41. .I delim
  42. ] [
  43. .B -w
  44. .I worksheet-range
  45. ]
  46. .IB mtpt /Workbook
  47. .SH DESCRIPTION
  48. .I Doc2txt
  49. is an
  50. .IR rc (1)
  51. script that uses
  52. .I olefs
  53. and
  54. .I mswordstrings
  55. to extract the printable text from the body of a Microsoft Word document
  56. and write it on the standard output.
  57. .I Doc2ps
  58. is similar, but emits PostScript corresponding to the document.
  59. .I Wdoc2txt
  60. is similar to
  61. .IR doc2txt ,
  62. but uses
  63. .IR plumb (1)
  64. to send the output to a new
  65. .IR acme (1)
  66. window instead.
  67. .I Xls2txt
  68. performs a similar function for Microsoft Excel documents.
  69. .PP
  70. Microsoft Office documents are stored in OLE (Object Linking and Embedding)
  71. format, which is a scaled down version of Microsoft's FAT file system.
  72. .I Olefs
  73. presents the contents of an MS Office document as a file system
  74. on
  75. .IR mtpt ,
  76. which defaults to
  77. .BR /mnt/doc .
  78. .I Mswordstrings
  79. or
  80. .I msexceltables
  81. may then be used to parse the files inside, extracting
  82. a text stream.
  83. .I Msexceltables
  84. may be given options to control the formatting of its output.
  85. .TF "\fL-d \fIdelim"
  86. .TP
  87. .B -a
  88. Attempt conversion of non-tabular sheets in the workbook (charts).
  89. .TP
  90. .BI -d " delim
  91. Sets the inter-field delimiter to the string
  92. .IR delim ,
  93. by default a single space.
  94. .TP
  95. .B -D
  96. Enables debugging output.
  97. .TP
  98. .B -n
  99. Disables field padding to column width.
  100. .TP
  101. .B -t
  102. Truncate fields to the column width.
  103. .TP
  104. .BI -w " pages
  105. .I Pages
  106. is a comma-separated list of page numbers and ranges.
  107. Ranges are separated by dashes.
  108. Limit processing to just those pages named;
  109. by default all tabular sheets are output.
  110. Suppressed chart pages are always included in the sheet count.
  111. .SH EXAMPLE
  112. Extract pieces of an MS Excel spreadsheet.
  113. .PD 0
  114. .IP
  115. .EX
  116. aux/olefs report.xls
  117. msexceltables -w 1,7,9-14,3-4 -n -d '@' /mnt/doc/Workbook
  118. unmount /mnt/doc
  119. .EE
  120. .PD
  121. .SH SOURCE
  122. .TF "\fL/sys/src/cmd/aux "
  123. .TP
  124. .B /rc/bin
  125. .BR doc2txt ,
  126. .BR doc2ps ,
  127. .BR wdoc2txt,
  128. and
  129. .BR xls2txt
  130. .TP
  131. .B /sys/src/cmd/aux
  132. the others
  133. .fi
  134. .PD
  135. .SH SEE ALSO
  136. .IR strings (1)
  137. .br
  138. ``Microsoft Word 97 Binary File Format'',
  139. at Microsoft's developer (MSDN) home page.
  140. .br
  141. ``LAOLA Binary Structures'',
  142. .B http://user.cs.tu-berlin.de/~schwartz/pmh
  143. .br
  144. ``OpenOffice.Org's Excel Documentation'',
  145. .br
  146. .B http://sc.openoffice.org/excelfileformat.pdf