doc2txt 1.1 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
  1. .TH DOC2TXT 1
  2. .SH NAME
  3. doc2txt, olefs, mswordstrings \- extract printable strings from Microsoft Word documents
  4. .SH SYNOPSIS
  5. .B doc2txt
  6. [
  7. .I file.doc
  8. ]
  9. .br
  10. .B aux/olefs
  11. [
  12. .B -m
  13. .I mtpt
  14. ]
  15. .I file.doc
  16. .br
  17. .B aux/mswordstrings
  18. .I /mnt/doc/WordDocument
  19. .SH DESCRIPTION
  20. .I Doc2txt
  21. is a shell script that uses
  22. .I olefs
  23. and
  24. .I mswordstrings
  25. to extract the printable text from the body of a Microsoft Word document.
  26. .PP
  27. Microsoft Office documents are stored in OLE (Object Linking and Embedding)
  28. format, which is a scaled down version of Microsoft's FAT file system.
  29. .I Olefs
  30. presents the contents of an Office document as a file system
  31. on
  32. .IR mtpt ,
  33. which defaults to
  34. .BR /mnt/doc .
  35. .I Mswordstrings
  36. parses the
  37. .I WordDocument
  38. file inside an Office document, extracting
  39. the text stream.
  40. .SH SOURCE
  41. .B /sys/src/cmd/aux/mswordstrings.c
  42. .br
  43. .B /sys/src/cmd/aux/olefs.c
  44. .br
  45. .B /rc/bin/doc2txt
  46. .SH SEE ALSO
  47. .IR strings (1)
  48. .br
  49. ``Microsoft Word 97 Binary File Format'',
  50. available on line at Microsoft's developer home page.
  51. .br
  52. ``LAOLA Binary Structures'',
  53. .IR snake.cs.tu-berlin.de:8081/~schwartz/pmh .