doc2txt 2.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
  1. .TH DOC2TXT 1
  2. .SH NAME
  3. doc2txt, xls2txt olefs, mswordstrings msexceltable \- extract printable strings from Microsoft Office documents
  4. .SH SYNOPSIS
  5. .B doc2txt
  6. [
  7. .I file.doc
  8. ]
  9. .br
  10. .B xls2txt
  11. [
  12. .I file.xls
  13. ]
  14. .br
  15. .B aux/olefs
  16. [
  17. .B -m
  18. .I mtpt
  19. ]
  20. .I file.doc
  21. .br
  22. .B aux/mswordstrings
  23. .I /mnt/doc/WordDocument
  24. .br
  25. .B aux/msexceltable
  26. [
  27. .B -aDnt
  28. ] [
  29. .B -d
  30. .I delim
  31. ]
  32. .B -w
  33. .I worksheet-range
  34. ]
  35. .I /mnt/doc/Workbook
  36. .SH DESCRIPTION
  37. .I Doc2txt
  38. is a shell script that uses
  39. .I olefs
  40. and
  41. .I mswordstrings
  42. to extract the printable text from the body of a Microsoft Word document.
  43. .I Xls2txt
  44. performs a similar function for Microsoft Excel documents.
  45. .PP
  46. Microsoft Office documents are stored in OLE (Object Linking and Embedding)
  47. format, which is a scaled down version of Microsoft's FAT file system.
  48. .I Olefs
  49. presents the contents of an Office document as a file system
  50. on
  51. .IR mtpt ,
  52. which defaults to
  53. .BR /mnt/doc .
  54. .I Mswordstrings
  55. or
  56. .I msexceltables
  57. may then be used to parse the files inside, extracting
  58. a text stream.
  59. .I Msexceltables
  60. may be given options to control the formatting of its output.
  61. .TP
  62. .B -n
  63. Disables field padding to colum width.
  64. .TP
  65. .B -t
  66. Truncate fields to the colum width.
  67. .TP
  68. .B -a
  69. Attempt conversion of non-tabular sheets in the workbook. (charts).
  70. .TP
  71. .BI -d " delim
  72. Sets the interfield delimiter to the string
  73. .IR delim ,
  74. by default a single space.
  75. .TP
  76. .B -D
  77. Enables debugging output.
  78. .TP
  79. .BI -w " worksheet-spec
  80. Specifies which worksheets to process, by default all tabular sheets are
  81. output \- suspressed chart pages are always included in the sheet count.
  82. Arbitary lists of pages or page ranges may be given, individual pages
  83. are seperated by commas, sheet ranges are seperated by a minus.
  84. .SH EXAMPLE
  85. .EX
  86. aux/olefs report.xls
  87. msexceltables -w 1,7,9-14,3-4 -n -d '@' /mnt/doc/Workbook
  88. unmount /mnt/doc
  89. .EE
  90. .SH SOURCE
  91. .B /sys/src/cmd/aux/mswordstrings.c
  92. .br
  93. .B /sys/src/cmd/aux/msexceltables.c
  94. .br
  95. .B /sys/src/cmd/aux/olefs.c
  96. .br
  97. .B /rc/bin/xls2txt
  98. .br
  99. .B /rc/bin/doc2txt
  100. .SH SEE ALSO
  101. .IR strings (1)
  102. .br
  103. ``Microsoft Word 97 Binary File Format'',
  104. available on line at Microsoft's developer home page.
  105. .br
  106. ``LAOLA Binary Structures'',
  107. .I http://snake.cs.tu-berlin.de:8081/~schwartz/pmh
  108. .br
  109. ``OpenOffice.Org's Excel Documentation'',
  110. .I http://sc.openoffice.org/excelfileformat.pdf