123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110 |
- .TH DOC2TXT 1
- .SH NAME
- doc2txt, xls2txt olefs, mswordstrings msexceltable \- extract printable strings from Microsoft Office documents
- .SH SYNOPSIS
- .B doc2txt
- [
- .I file.doc
- ]
- .br
- .B xls2txt
- [
- .I file.xls
- ]
- .br
- .B aux/olefs
- [
- .B -m
- .I mtpt
- ]
- .I file.doc
- .br
- .B aux/mswordstrings
- .I /mnt/doc/WordDocument
- .br
- .B aux/msexceltable
- [
- .B -aDnt
- ] [
- .B -d
- .I delim
- ]
- .B -w
- .I worksheet-range
- ]
- .I /mnt/doc/Workbook
- .SH DESCRIPTION
- .I Doc2txt
- is a shell script that uses
- .I olefs
- and
- .I mswordstrings
- to extract the printable text from the body of a Microsoft Word document.
- .I Xls2txt
- performs a similar function for Microsoft Excel documents.
- .PP
- Microsoft Office documents are stored in OLE (Object Linking and Embedding)
- format, which is a scaled down version of Microsoft's FAT file system.
- .I Olefs
- presents the contents of an Office document as a file system
- on
- .IR mtpt ,
- which defaults to
- .BR /mnt/doc .
- .I Mswordstrings
- or
- .I msexceltables
- may then be used to parse the files inside, extracting
- a text stream.
- .I Msexceltables
- may be given options to control the formatting of its output.
- .TP
- .B -n
- Disables field padding to colum width.
- .TP
- .B -t
- Truncate fields to the colum width.
- .TP
- .B -a
- Attempt conversion of non-tabular sheets in the workbook. (charts).
- .TP
- .BI -d " delim
- Sets the interfield delimiter to the string
- .IR delim ,
- by default a single space.
- .TP
- .B -D
- Enables debugging output.
- .TP
- .BI -w " worksheet-spec
- Specifies which worksheets to process, by default all tabular sheets are
- output \- suspressed chart pages are always included in the sheet count.
- Arbitary lists of pages or page ranges may be given, individual pages
- are seperated by commas, sheet ranges are seperated by a minus.
- .SH EXAMPLE
- .EX
- aux/olefs report.xls
- msexceltables -w 1,7,9-14,3-4 -n -d '@' /mnt/doc/Workbook
- unmount /mnt/doc
- .EE
- .SH SOURCE
- .B /sys/src/cmd/aux/mswordstrings.c
- .br
- .B /sys/src/cmd/aux/msexceltables.c
- .br
- .B /sys/src/cmd/aux/olefs.c
- .br
- .B /rc/bin/xls2txt
- .br
- .B /rc/bin/doc2txt
- .SH SEE ALSO
- .IR strings (1)
- .br
- ``Microsoft Word 97 Binary File Format'',
- available on line at Microsoft's developer home page.
- .br
- ``LAOLA Binary Structures'',
- .I http://snake.cs.tu-berlin.de:8081/~schwartz/pmh
- .br
- ``OpenOffice.Org's Excel Documentation'',
- .I http://sc.openoffice.org/excelfileformat.pdf
|