123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146 |
- .TH DOC2TXT 1
- .SH NAME
- doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltables
- \- extract printable text from Microsoft documents
- .SH SYNOPSIS
- .B doc2txt
- [
- .I file.doc
- ]
- .br
- .B doc2ps
- [
- .I file.doc
- ]
- .br
- .B wdoc2txt
- [
- .I file.doc
- ]
- .br
- .B xls2txt
- [
- .I file.xls
- ]
- .br
- .B aux/olefs
- [
- .B -m
- .I mtpt
- ]
- .I file.doc
- .br
- .B aux/mswordstrings
- .IB mtpt /WordDocument
- .br
- .B aux/msexceltables
- [
- .B -aDnt
- ] [
- .B -d
- .I delim
- ] [
- .B -w
- .I worksheet-range
- ]
- .IB mtpt /Workbook
- .SH DESCRIPTION
- .I Doc2txt
- is an
- .IR rc (1)
- script that uses
- .I olefs
- and
- .I mswordstrings
- to extract the printable text from the body of a Microsoft Word document
- and write it on the standard output.
- .I Doc2ps
- is similar, but emits PostScript corresponding to the document.
- .I Wdoc2txt
- is similar to
- .IR doc2txt ,
- but uses
- .IR plumb (1)
- to send the output to a new
- .IR acme (1)
- window instead.
- .I Xls2txt
- performs a similar function for Microsoft Excel documents.
- .PP
- Microsoft Office documents are stored in OLE (Object Linking and Embedding)
- format, which is a scaled down version of Microsoft's FAT file system.
- .I Olefs
- presents the contents of an MS Office document as a file system
- on
- .IR mtpt ,
- which defaults to
- .BR /mnt/doc .
- .I Mswordstrings
- or
- .I msexceltables
- may then be used to parse the files inside, extracting
- a text stream.
- .I Msexceltables
- may be given options to control the formatting of its output.
- .TF "\fL-d \fIdelim"
- .TP
- .B -a
- Attempt conversion of non-tabular sheets in the workbook (charts).
- .TP
- .BI -d " delim
- Sets the inter-field delimiter to the string
- .IR delim ,
- by default a single space.
- .TP
- .B -D
- Enables debugging output.
- .TP
- .B -n
- Disables field padding to column width.
- .TP
- .B -t
- Truncate fields to the column width.
- .TP
- .BI -w " pages
- .I Pages
- is a comma-separated list of page numbers and ranges.
- Ranges are separated by dashes.
- Limit processing to just those pages named;
- by default all tabular sheets are output.
- Suppressed chart pages are always included in the sheet count.
- .SH EXAMPLE
- Extract pieces of an MS Excel spreadsheet.
- .PD 0
- .IP
- .EX
- aux/olefs report.xls
- msexceltables -w 1,7,9-14,3-4 -n -d '@' /mnt/doc/Workbook
- unmount /mnt/doc
- .EE
- .PD
- .SH SOURCE
- .TF "\fL/sys/src/cmd/aux "
- .TP
- .B /rc/bin
- .BR doc2txt ,
- .BR doc2ps ,
- .BR wdoc2txt,
- and
- .BR xls2txt
- .TP
- .B /sys/src/cmd/aux
- the others
- .fi
- .PD
- .SH SEE ALSO
- .IR strings (1)
- .br
- ``Microsoft Word 97 Binary File Format'',
- at Microsoft's developer (MSDN) home page.
- .br
- ``LAOLA Binary Structures'',
- .B http://user.cs.tu-berlin.de/~schwartz/pmh
- .br
- ``OpenOffice.Org's Excel Documentation'',
- .br
- .B http://sc.openoffice.org/excelfileformat.pdf
|