ebookimp.ms 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389
  1. .TL
  2. Navigating Large XML Documents on Small Devices
  3. .AU
  4. Roger Peppe
  5. .AI
  6. Vita Nuova
  7. .br
  8. April 2002
  9. .AB
  10. Browsing eBooks on platforms with limited memory presents an
  11. interesting problem: how can memory usage be bounded despite
  12. the need to view documents that may be much larger than the
  13. available memory. A simple interface to an XML parser enables
  14. this whilst retaining much of the ease of access afforded
  15. by XML parsers that read all of a document into memory at once.
  16. .AE
  17. .SH
  18. Introduction
  19. .LP
  20. The Open Ebook Publication Structure was devised by the Open Ebook Forum
  21. in order to ``provide a specification for representing the content of electronic
  22. books''. It is based on many existing standards, notably XML and HTML.
  23. An Open eBook publication consists of a set of documents bound together
  24. with an Open eBook package file which enumerates all the documents,
  25. pictures and other items that make up the book
  26. .LP
  27. The underlying document format is essentially HTML compatible,
  28. which is where the first problem arises: HTML was not designed to
  29. make it easy to view partial sections of a document. Conventionally
  30. an entire HTML document is read in at once and rendered onto
  31. the device. When viewing an eBook on a limited-memory device,
  32. however, this may not be possible; books tend to be fairly large.
  33. For such a device, the ideal format would keep the book itself
  34. in non-volatile storage (e.g. flash or disk) and make it possible
  35. for reader to seek to an arbitrary position in the book and render
  36. what it finds there.
  37. .LP
  38. This is not possible in an HTML or XML document, as the
  39. arbitrarily nested nature of the format means that every
  40. position in the document has some unknown surrounding context,
  41. which cannot be discovered without reading sequentially through
  42. the document from the beginning.
  43. .SH
  44. SAX and DOM
  45. .LP
  46. There are two conventional programming interfaces to an XML
  47. parser. A SAX parser provides a stream of XML entities, leaving
  48. it up to the application to maintain the context. It is not possible
  49. to rewind the stream, except, perhaps, to the beginning.
  50. Using a SAX parser is
  51. fairly straightforward, but awkward: the stream-like nature
  52. of the interface does not map well to the tree-like structure
  53. that is XML. A DOM parser reads a whole document into an internal
  54. data structure representation, so a program can treat it exactly
  55. as a tree. This also enables a program to access parts of the
  56. document in an arbitrary order.
  57. The DOM approach is all very well for small documents, but for large
  58. documents the memory usage can rapidly grow to exceed
  59. the available memory capacity. For eBook documents, this is unacceptable.
  60. .SH
  61. A different approach
  62. .LP
  63. The XML parser used in the eBook browser is akin to a SAX parser,
  64. in that only a little of the XML structure is held in memory at one time.
  65. The first significant difference is that the XML entities returned are
  66. taken from one level of the tree - if the program does not wish to
  67. see the contents of a particular XML tag, it is trivial to skip over.
  68. The second significant difference is that random access is possible.
  69. This possibility comes from the observation that if we have visited
  70. a part of the document we can record the context that we found there
  71. and restore it later if necessary. In this scheme, if we wish to return later to
  72. a part of a document that we are currently at, we can create a ``mark'',
  73. a token that holds the current context; at some later time we can use
  74. that mark to return to this position.
  75. .LP
  76. The eBook browser uses this technique to enable random access
  77. to the document on a page-by-page basis. Moreover a mark
  78. can be written to external storage, thus allowing an external
  79. ``index'' into the document so it is not always necessary to
  80. read the entire document from the start in order to jump to a particular
  81. page in that document.
  82. .SH
  83. The programming interface
  84. .LP
  85. The interface is implemented by a module named
  86. .CW Xml ,
  87. which provides a
  88. .CW Parser
  89. adt which gives access to the contents of an XML document.
  90. Xml items are represented by an
  91. .CW Item
  92. pick adt with one branch of the pick corresponding to each
  93. type of item that might be encountered.
  94. .LP
  95. The interface to the parser looks like this:
  96. .P1
  97. open: fn(f: string, warning: chan of (Locator, string)): (ref Parser, string);
  98. Parser: adt {
  99. next: fn(p: self ref Parser): ref Item;
  100. down: fn(p: self ref Parser);
  101. up: fn(p: self ref Parser);
  102. mark: fn(p: self ref Parser): ref Mark;
  103. atmark: fn(p: self ref Parser, m: ref Mark): int;
  104. goto: fn(p: self ref Parser, m: ref Mark);
  105. str2mark: fn(p: self ref Parser, s: string): ref Mark;
  106. };
  107. .P2
  108. To start parsing an XML document, it must first be
  109. .CW open ed;
  110. .CW warning
  111. is a channel on which non-fatal error messages will be sent
  112. if they are encountered during the parsing of the document.
  113. It can be nil, in which case warnings are ignored.
  114. If the document is opened successfully, a new
  115. .CW Parser
  116. adt, say
  117. .I p ,
  118. is returned.
  119. Calling
  120. .CW \fIp\fP.next
  121. returns the next XML item at the current level of the tree. If there
  122. are no more items in the current branch at the current level, it
  123. returns
  124. .CW nil .
  125. When a
  126. .CW Tag
  127. item is returned,
  128. .CW \fIp\fP.down
  129. can be used to descend ``into'' that tag; subsequent calls of
  130. .CW \fIp\fP.next
  131. will return XML items contained within the tag,
  132. and
  133. .CW \fIp\fP.up
  134. returns to the previous level.
  135. .LP
  136. An
  137. .CW Item
  138. is a pick adt:
  139. .P1
  140. Item: adt {
  141. fileoffset: int;
  142. pick {
  143. Tag =>
  144. name: string;
  145. attrs: Attributes;
  146. Text =>
  147. ch: string;
  148. ws1, ws2: int;
  149. Process =>
  150. target: string;
  151. data: string;
  152. Doctype =>
  153. name: string;
  154. public: int;
  155. params: list of string;
  156. Stylesheet =>
  157. attrs: Attributes;
  158. Error =>
  159. loc: Locator;
  160. msg: string;
  161. }
  162. };
  163. .P2
  164. .CW Item.Tag
  165. represents a XML tag, empty or not. The XML
  166. fragments
  167. .CW "<tag></tag>" '' ``
  168. and
  169. .CW "<tag />" '' ``
  170. look identical from the point of view of this interface.
  171. A
  172. .CW Text
  173. item holds text found in between tags, with adjacent whitespaces merged
  174. and whitespace at the beginning and end of the text elided.
  175. .CW Ws1
  176. and
  177. .CW ws2
  178. are non-zero if there was originally whitespace at the beginning
  179. or end of the text respectively.
  180. .CW Process
  181. represents an XML processing request, as found between
  182. .CW "<?....?>" '' ``
  183. delimiters.
  184. .CW Doctype
  185. and
  186. .CW Stylesheet
  187. are items found in an XML document's prolog, the
  188. former representing a
  189. .CW "<!DOCTYPE...>" '' ``
  190. document type declaration, and the latter an XML
  191. stylesheet processing request.
  192. .LP
  193. When most applications are processing documents, they
  194. will wish to ignore all items other than
  195. .CW Tag
  196. and
  197. .CW Text .
  198. To this end, it is conventional to define a ``front-end'' function
  199. to return desired items, discard others, and take an appropriate
  200. action when an error is encountered. Here's an example:
  201. .P1
  202. nextitem(p: ref Parser): ref Item
  203. {
  204. while ((gi := p.next()) != nil) {
  205. pick i := gi {
  206. Error =>
  207. sys->print("error at %s:%d: %s\n",
  208. i.loc.systemid, i.loc.line, i.msg);
  209. exit;
  210. Process =>
  211. ; # ignore
  212. Stylesheet =>
  213. ; # ignore
  214. Doctype =>
  215. ; # ignore
  216. * =>
  217. return gi;
  218. }
  219. }
  220. return nil;
  221. }
  222. .P2
  223. When
  224. .CW nextitem
  225. encounters an error, it exits; it might instead handle the
  226. error another way, say by raising an exception to be caught at the
  227. outermost level of the parsing code.
  228. .SH
  229. A small example
  230. .LP
  231. Suppose we have an XML document that contains some data that we would
  232. like to extract, ignoring the rest of the document. For this example we will
  233. assume that the data is held within
  234. .CW <data>
  235. tags, which contain zero or more
  236. .CW <item>
  237. tags, holding the actual data as text within them.
  238. Tags that we do not recognize we choose to ignore.
  239. So for example, given the following XML document:
  240. .P1
  241. <metadata>
  242. <a>hello</a>
  243. <b>goodbye</b>
  244. </metadata>
  245. <data>
  246. <item>one</item>
  247. <item>two</item>
  248. <item>three</item>
  249. </data>
  250. <data>
  251. <item>four</item>
  252. </data>
  253. .P2
  254. we wish to extract all the data items, but ignore everything inside
  255. the
  256. .CW <metadata>
  257. tag. First, let us define another little convenience function to get
  258. the next XML tag, ignoring extraneous items:
  259. .P1
  260. nexttag(p: ref Parser): ref Item.Tag
  261. {
  262. while ((gi := nextitem(p)) != nil) {
  263. pick i := gi {
  264. Tag =>
  265. return i;
  266. }
  267. }
  268. return nil;
  269. }
  270. .P2
  271. Assuming that the document has already been opened,
  272. the following function scans through the document, looking
  273. for top level
  274. .CW <data>
  275. tags, and ignoring others:
  276. .P1
  277. document(p: ref Parser)
  278. {
  279. while ((i := nexttag(p)) != nil) {
  280. if (i.name == "data") {
  281. p.down();
  282. data(p);
  283. p.up();
  284. }
  285. }
  286. }
  287. .P2
  288. The function to parse a
  289. .CW <data>
  290. tag is almost as straightforward; it scans for
  291. .CW <item>
  292. tags and extracts any textual data contained therein:
  293. .P1
  294. data(p: ref Parser)
  295. {
  296. while ((i := nexttag(p)) != nil) {
  297. if (i.name == "item") {
  298. p.down();
  299. if ((gni := p.next()) != nil) {
  300. pick ni := gni {
  301. Text =>
  302. sys->print("item data: %s\n", ni.ch);
  303. }
  304. }
  305. p.up();
  306. }
  307. }
  308. }
  309. .P2
  310. The above program is all very well and works fine, but
  311. suppose that the document that we are parsing is very
  312. large, with data items scattered through its length, and that
  313. we wish to access those items in an order that is not necessarily
  314. that in which they appear in the document.
  315. This is quite straightforward; every time we see a
  316. data item, we record the current position with a mark.
  317. Assuming the global declaration:
  318. .P1
  319. marks: list of ref Mark;
  320. .P2
  321. the
  322. .CW document
  323. function might become:
  324. .P1
  325. document(p: ref Parser)
  326. {
  327. while ((i := nexttag(p)) != nil) {
  328. if (i.name == "data") {
  329. p.down();
  330. marks = p.mark() :: marks;
  331. p.up();
  332. }
  333. }
  334. }
  335. .P2
  336. At some later time, we can access the data items arbitrarily,
  337. for instance:
  338. .P1
  339. for (m := marks; m != nil; m = tl m) {
  340. p.goto(hd m);
  341. data(p);
  342. }
  343. .P2
  344. If we wish to store the data item marks in some external index
  345. (in a file, perhaps), the
  346. .CW Mark
  347. adt provides a
  348. .CW str
  349. function which returns a string representation of the mark.
  350. .CW Parser 's
  351. .CW str2mark
  352. function can later be used to recover the mark. Care must
  353. be taken that the document it refers to has not been changed,
  354. otherwise it is likely that the mark will be invalid.
  355. .SH
  356. The eBook implementation
  357. .LP
  358. The Open eBook reader software uses the primitives described above
  359. to maintain display-page-based access to arbitrarily large documents
  360. while trying to bound memory usage.
  361. Unfortunately it is difficult to unconditionally bound memory usage,
  362. given that any element in an XML document may be arbitrarily
  363. large. For instance a perfectly legal document might have 100MB
  364. of continuous text containing no tags whatsoever. The described
  365. interface would attempt to put all this text in one single item, rapidly
  366. running out of memory! Similar types of problems can occur when
  367. gathering the items necessary to format a particular tag.
  368. For instance, to format the first row of a table, it is necessary to lay out
  369. the entire table to determine the column widths.
  370. .LP
  371. I chose to make the simplifying assumption that top-level items within
  372. the document would be small enough to fit into memory.
  373. From the point of view of the display module, the document
  374. looks like a simple sequence of items, one after another.
  375. One item might cover more than one page, in which case a different
  376. part of it will be displayed on each of those pages.
  377. .LP
  378. One difficulty is that the displayed size of an item depends on many
  379. factors, such as stylesheet parameters, size of installed fonts, etc.
  380. When a document is read, the page index must have been created
  381. from the same document with the same parameters. It is difficult in
  382. general to enumerate all the relevant parameters; they would need
  383. to be stored inside, or alongside the index; any change would invalidate
  384. the index. Instead of doing this, as the document is being displayed,
  385. the eBook display program constantly checks to see if the results
  386. it is getting from the index match with the results it is getting
  387. when actually laying out the document. If the results differ, the
  388. index is remade; the discrepancy will hopefully not be noticed by
  389. the user!