doc2txt 2.0 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
  1. .TH DOC2TXT 1
  2. .SH NAME
  3. doc2txt, xls2txt olefs, mswordstrings msexceltable \- extract printable strings from Microsoft Office documents
  4. .SH SYNOPSIS
  5. .B doc2txt
  6. [
  7. .I file.doc
  8. ]
  9. .br
  10. .B xls2txt
  11. [
  12. .I file.xls
  13. ]
  14. .br
  15. .B aux/olefs
  16. [
  17. .B -m
  18. .I mtpt
  19. ]
  20. .I file.doc
  21. .br
  22. .B aux/mswordstrings
  23. .I /mnt/doc/WordDocument
  24. .br
  25. .B aux/msexceltable
  26. [
  27. .B -n
  28. ] [
  29. .B -t
  30. ] [
  31. .B -a
  32. ] [
  33. .BI -d delim
  34. ]
  35. .I /mnt/doc/Workbook
  36. .SH DESCRIPTION
  37. .I Doc2txt
  38. is a shell script that uses
  39. .I olefs
  40. and
  41. .I mswordstrings
  42. to extract the printable text from the body of a Microsoft Word document.
  43. .I Xls2txt
  44. performs a similar function for Microsoft Excel documents.
  45. .PP
  46. Microsoft Office documents are stored in OLE (Object Linking and Embedding)
  47. format, which is a scaled down version of Microsoft's FAT file system.
  48. .I Olefs
  49. presents the contents of an Office document as a file system
  50. on
  51. .IR mtpt ,
  52. which defaults to
  53. .BR /mnt/doc .
  54. .I Mswordstrings
  55. or
  56. .I msexceltables
  57. may then be used to parse the files inside, extracting
  58. a text stream.
  59. .I Msexceltables
  60. may be given options to control the formatting of its output.
  61. .TP
  62. -n
  63. Disables field padding to colum width.
  64. .TP
  65. -t
  66. Truncate fields to the colum width.
  67. .TP
  68. -a
  69. Attempt conversion of non-tabular sheets in the workbook. (charts).
  70. .TP
  71. -d \fIdelim\fR
  72. Sets the interfield delimiter to the string \fIdelim\fR, by default a single space.
  73. .SH SOURCE
  74. .B /sys/src/cmd/aux/mswordstrings.c
  75. .br
  76. .B /sys/src/cmd/aux/msexceltables.c
  77. .br
  78. .B /sys/src/cmd/aux/olefs.c
  79. .br
  80. .B /rc/bin/xls2txt
  81. .br
  82. .B /rc/bin/doc2txt
  83. .SH BUGS
  84. .I Msexcelstrings
  85. cannot parse files containing rich text field descriptions or Asian phonetic
  86. pronunciation hints due to a lack of ducumentation on these formats; It has
  87. only been tested on BIFF8 files generated by MS Office 97; Caveat Emptor.
  88. .SH SEE ALSO
  89. .IR strings (1)
  90. .br
  91. ``Microsoft Word 97 Binary File Format'',
  92. available on line at Microsoft's developer home page.
  93. .br
  94. ``LAOLA Binary Structures'',
  95. .I http://snake.cs.tu-berlin.de:8081/~schwartz/pmh
  96. .br
  97. ``OpenOffice.Org's Excel Documentation'',
  98. .I http:\/\/sc.openoffice.org/excelfileformat.pdf