fossil.ms 31 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044104510461047104810491050105110521053105410551056105710581059106010611062106310641065106610671068106910701071107210731074107510761077107810791080108110821083108410851086108710881089109010911092109310941095109610971098109911001101110211031104110511061107110811091110111111121113111411151116111711181119112011211122112311241125112611271128112911301131113211331134113511361137113811391140114111421143114411451146114711481149115011511152115311541155115611571158115911601161116211631164
  1. ... .FP times
  2. ... .fp 1 R R.nomath
  3. ... .fp 5 CW LucidaSansCW83
  4. .TL
  5. Fossil, an Archival File Server
  6. .AU
  7. Sean Quinlan
  8. Jim McKie
  9. Russ Cox
  10. jmk,rsc@plan9.bell-labs.com
  11. .AB
  12. This paper describes the internals and
  13. operation of Fossil, an archival file server built for Plan 9.
  14. Fossil has not yet replaced the current Plan 9 file server
  15. and
  16. .CW kfs ,
  17. but that is our eventual intent.
  18. Both fossil and this documentation are
  19. works in progress. Comments on either are most welcome.
  20. .AE
  21. .de HP
  22. .LP
  23. ..
  24. .NH 1
  25. Introduction
  26. .HP
  27. Fossil is an archival file server built for Plan 9.
  28. In a typical configuration, it maintains a traditional file system
  29. in a local disk partition and periodically archives snapshots of the file system
  30. to a Venti server. These archives are made available through
  31. a file system interface.
  32. Fossil can also be run without a Venti server, in which case the
  33. snapshots (if any) occupy local disk space.
  34. .PP
  35. The bulk of this paper explains the underlying data structures:
  36. Venti trees, the Venti archival file system format, and finally Fossil's
  37. file system format.
  38. The end of the paper discusses the architecture of the Fossil server.
  39. .PP
  40. The presentation of the data structures is very detailed, perhaps
  41. too detailed for most readers.
  42. The intent is to record all the details necessary to make structural
  43. changes to the file system format.
  44. Feel free to jump ahead when boredom strikes.
  45. .NH 1
  46. Venti trees and directory hierarchies
  47. .HP
  48. Venti [3] is an archival block storage server.
  49. Once a block is stored, it can be retrieved by presenting the 20-byte
  50. SHA1 hash of its contents, called a
  51. .I score .
  52. Blocks on Venti have a maximum length of about 56 kilobytes,
  53. though in practice smaller blocks are used.
  54. To store a byte stream of arbitrary length, Venti uses a hash tree.
  55. Conceptually, the data stream is broken into fixed-size (say,
  56. .I dsize -byte)
  57. chunks, which are stored on the Venti server.
  58. The resulting scores are concatenated into a new pointer stream, which is
  59. broken into fixed size (say,
  60. .I psize -byte)
  61. chunks, which are stored on the Venti server.
  62. .I Psize "" (
  63. is different from
  64. .I dsize
  65. so that we can ensure that each pointer block holds an
  66. integral number of pointers.)
  67. This yields a new pointer stream, and so on, until there is a single block
  68. and finally a single score describing the entire tree.
  69. The resulting structure looks like:
  70. .PS
  71. .ps 8
  72. .vs 10
  73. boxht=0.1
  74. boxwid=0.1
  75. B0: box invis wid 1 "\f(CWVtDataType\fP"
  76. move right 0.1
  77. L0a: box wid 0.2
  78. move right 0.1
  79. L0b: box wid 0.2
  80. move right 0.1
  81. L0c: box invis wid 0.2 "..."
  82. move right 0.1
  83. L0d: box wid 0.2
  84. move right 0.1
  85. L0e: box wid 0.2
  86. move right 0.2
  87. L0f: box invis wid 0.2 "..."
  88. move right 0.2
  89. L0g: box wid 0.2
  90. move right 0.1
  91. L0h: box wid 0.2
  92. move right 0.1
  93. L0i: box invis wid 0.2 "..."
  94. move right 0.1
  95. L0j: box wid 0.2
  96. move right 0.1
  97. L0k: box wid 0.2
  98. move right 0.1
  99. L0l: box invis wid 0.2 "..."
  100. move right 0.1
  101. L0m: box wid 0.2
  102. define boxppddd {
  103. line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
  104. line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
  105. X: box invis at 0.1<$1.nw,$1.ne>
  106. Y: box invis at 0.1<$1.sw,$1.se>
  107. line -> from 0.5<X,Y> to $2.nw
  108. X: box invis at 0.3<$1.nw,$1.ne>
  109. Y: box invis at 0.3<$1.sw,$1.se>
  110. line -> from 0.5<X,Y> to $3.nw
  111. "..." at 0.7<$1.w,$1.e>
  112. }
  113. define boxppdddp {
  114. line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
  115. line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
  116. line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
  117. X: box invis at 0.1<$1.nw,$1.ne>
  118. Y: box invis at 0.1<$1.sw,$1.se>
  119. line -> from 0.5<X,Y> to $2.nw
  120. X: box invis at 0.3<$1.nw,$1.ne>
  121. Y: box invis at 0.3<$1.sw,$1.se>
  122. line -> from 0.5<X,Y> to $3.nw
  123. "..." at 0.6<$1.w,$1.e>
  124. X: box invis at 0.9<$1.nw,$1.ne>
  125. Y: box invis at 0.9<$1.sw,$1.se>
  126. line -> from 0.5<X,Y> to $4.nw
  127. }
  128. define boxpdddp {
  129. line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
  130. line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
  131. X: box invis at 0.1<$1.nw,$1.ne>
  132. Y: box invis at 0.1<$1.sw,$1.se>
  133. line -> from 0.5<X,Y> to $2.nw
  134. "..." at 0.5<$1.w,$1.e>
  135. X: box invis at 0.9<$1.nw,$1.ne>
  136. Y: box invis at 0.9<$1.sw,$1.se>
  137. line -> from 0.5<X,Y> to $3.nw
  138. }
  139. bhd=0.4
  140. L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd)
  141. boxppddd(L1abc, L0a, L0b)
  142. L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd)
  143. boxppddd(L1def, L0d, L0e)
  144. L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd)
  145. boxppddd(L1ghi, L0g, L0h)
  146. L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd)
  147. boxppdddp(L1jklm, L0j, L0k, L0m)
  148. B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd)
  149. L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd)
  150. boxppddd(L2abcdef, L1abc, L1def)
  151. L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd)
  152. boxpdddp(L2ghijklm, L1ghi, L1jklm)
  153. B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd)
  154. L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd)
  155. boxpdddp(L3atom, L2abcdef, L2ghijklm)
  156. B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd)
  157. .PE
  158. .LP
  159. The leaves are the original data stream. Those blocks have type
  160. .CW VtDataType .
  161. The first pointer stream has type
  162. .CW VtPointerType0 ,
  163. the next has type
  164. .CW VtPointerType1 ,
  165. and so on.
  166. The figure ends with a single block of type
  167. .CW VtPointerType2 ,
  168. but in general trees can have height up to
  169. .CW VtPointerType6 .
  170. For a
  171. .I dsize
  172. of 8192 bytes
  173. and
  174. .I psize
  175. of 8180 bytes (409 pointers),
  176. this gives a maximum stream size of approximately 10 zettabytes
  177. (2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes).
  178. .PP
  179. Data block are truncated to remove trailing runs of zeros before
  180. storage to Venti; they are zero-filled back to
  181. .I dsize
  182. bytes after retrieval from Venti.
  183. SImilarly, trailing runs of pointers to zero-length blocks are
  184. removed from and added back to pointer blocks.
  185. These simple rules happen to make it particularly efficient to store
  186. large runs of zeros, as might occur in a data stream with ``holes:''
  187. the zero-length block itself can be interpreted as a tree of any
  188. depth encoding an all-zero data stream.
  189. .PP
  190. Reconstructing the data stream requires the score and type of the
  191. topmost block in the tree, the data chunk size, the pointer chunk size,
  192. and the data stream size.
  193. (From the data stream size and the chunk sizes we could derive the
  194. depth of the tree and thus the type of the topmost block, but it is convenient
  195. to allow trees that are deeper than necessary.)
  196. This information is kept in a 40-byte structure called a
  197. .CW VtEntry :
  198. .P1
  199. VtEntry:
  200. .ta +\w' 'u +\w' 'u
  201. gen[4] \fRgeneration number\fP
  202. psize[2] \fRsize of pointer blocks\fP
  203. dsize[2] \fRsize of data blocks\fP
  204. flags[1]
  205. zero[5]
  206. size[6] \fRlength of file\fP
  207. score[20] \fRscore of root block in tree\fP
  208. .P2
  209. (In this notation,
  210. .CW name[sz]
  211. indicates a
  212. .CW sz -byte
  213. field called
  214. .CW name .
  215. Integers are stored in big-endian order.
  216. .CW Size
  217. really is a 48-bit field.)
  218. .CW Flags
  219. is made up of the following bit fields.
  220. .I or ' `
  221. of the following flags:
  222. .P1
  223. .ta +\w' 'u +\w' 'u
  224. 0x01 VtEntryActive \fRentry is allocated\fP
  225. 0x02 VtEntryDir \fRentry describes a Venti directory (q.v.)\fP
  226. 0x1C VtEntryDepthMask \fRmask for tree depth\fP
  227. 0x20 VtEntryLocal \fRreserved (q.v.)\fP
  228. .P2
  229. .LP
  230. The depth of the described tree is stored in the 5 bits indicated:
  231. a tree with a topmost node of type
  232. .CW VtPointerType3
  233. has depth 4.
  234. .PP
  235. With
  236. .CW VtEntry
  237. we can build more complicated data structures,
  238. ones with multiple or nested data streams.
  239. A data stream consisting of
  240. .CW VtEntry
  241. structures is called a Venti directory.
  242. It is identical in structure to the Venti data stream
  243. we described earlier except that the bottom-level type is
  244. .CW VtDirType ,
  245. and
  246. the
  247. .CW VtEntry
  248. describing a Venti directory has the
  249. .CW VtEntryDir
  250. flag bit set.
  251. The
  252. .I dsize
  253. for a Venti directory
  254. is a multiple of 40 so that each data chunk holds
  255. an integer number of
  256. .CW VtEntry
  257. structures.
  258. By analogy with Venti directories,
  259. we call the original data stream a
  260. Venti file.
  261. Note that Venti files are assumed
  262. .I not
  263. to contain pointers to other Venti blocks.
  264. The only pointers to Venti blocks occur in
  265. .CW VtEntry
  266. structures in
  267. Venti directories
  268. (and in the internal hash tree structure of the
  269. individual files and directories).
  270. Note also that these directories are nothing more than pointer lists.
  271. In particular, there are no names or metadata like in a file system.
  272. .PP
  273. To make it easier to pass hierarchies between applications,
  274. the root of a hierarchy is described in a 300-byte structure
  275. called a
  276. .CW VtRoot :
  277. .P1
  278. VtRoot:
  279. .ta +\w' 'u +\w' 'u
  280. version[2] \f(CW2\fP
  281. name[128] \fRname of structure (just a comment)\fP
  282. type[128] \fRstring describing structure (\f(CWvac\fR)\f(CW
  283. score[20] \fRpointer to \f(CWVtDirType\fP block\f(CW
  284. blockSize[2] \fRmaximum block size in structure\fP
  285. prev[20] \fRprevious \f(CWVtRoot\fP in chain, if any\fP
  286. .P2
  287. .LP
  288. This structure is stored to Venti and its score is passed
  289. between applications, typically in the form
  290. ``\fItype\f(CW:\fIrootscore\fR,''
  291. where
  292. .I type
  293. is the type field from the
  294. .CW VtRoot
  295. structure, and
  296. .I rootscore
  297. is the score of the
  298. .CW VtRoot
  299. block.
  300. .CW VtRoot
  301. structures can be chained together using the
  302. .I prev
  303. field to encode an archival history
  304. of the data structure.
  305. .PP
  306. For example, a small Venti hierarchy might look like:
  307. .PS
  308. .ps 8
  309. .vs 10
  310. boxwid=0.1
  311. boxht=0.1
  312. f=0.9
  313. mb=0.16
  314. VtRoot: [
  315. right
  316. B1: box
  317. move right 0.1
  318. "\f(CWVtRoot\fP" ljust
  319. ]
  320. Root: [
  321. right
  322. B1: box fill f
  323. B2: box fill f
  324. B3: box fill f
  325. move right 0.1
  326. ] with .nw at VtRoot.sw+(0.2,-.1)
  327. Level1: [
  328. RootMeta: [
  329. box wid mb
  330. ]
  331. MetaSource: [
  332. right
  333. B1: box wid 5*mb
  334. ] with .nw at RootMeta.sw+(0,-.1)
  335. Source: [
  336. right
  337. B1: box fill f
  338. B2: box fill f
  339. B3: box fill f
  340. B4: box fill f
  341. B5: box fill f
  342. B6: box fill f
  343. B7: box fill f
  344. B8: box fill f
  345. ] with .nw at MetaSource.sw+(0,-.1)
  346. SB1: box invis at Source.B1
  347. SB2: box invis at Source.B2
  348. SB3: box invis at Source.B3
  349. ] with .nw at Root.sw+(0.4,-.1)
  350. Level2: [
  351. MetaSource: [
  352. right
  353. B1: box wid 5*mb
  354. ]
  355. Source: [
  356. right
  357. B1: box fill f
  358. B2: box fill f
  359. B3: box fill f
  360. B4: box fill f
  361. B5: box fill f
  362. B6: box fill f
  363. B7: box fill f
  364. B8: box fill f
  365. ] with .nw at MetaSource.sw+(0,-.1)
  366. File: box wid 0.8 with .nw at Source.sw+(0,-.1)
  367. ] with .nw at Level1.sw+(0.6,-.1)
  368. line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w
  369. line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w
  370. line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w
  371. line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w
  372. line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w
  373. line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w
  374. line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w
  375. [
  376. KEY: box wid 1.5 invis "Key"
  377. line from KEY.sw to KEY.se
  378. k = -.1
  379. kk=0.5
  380. A: [
  381. box wid 4*boxwid
  382. "Venti file" ljust with .w at last box .w+(kk,0)
  383. ] with .nw at KEY.sw+(0,2*k)
  384. B: [
  385. box fill f
  386. "Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0)
  387. ] with .nw at A.sw+(0,k)
  388. C: [
  389. right
  390. CC: box fill f
  391. box fill f
  392. box fill f
  393. box fill f
  394. "Venti directory" ljust with .w at CC.w+(kk,0)
  395. ] with .nw at B.sw+(0,k)
  396. D: [
  397. line -> right 3*boxwid
  398. "Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
  399. ] with .nw at C.sw+(0,k)
  400. ] with .nw at VtRoot.nw+(3,0)
  401. .PE
  402. .LP
  403. Venti files are shown as white boxes, while directories are shown
  404. as shaded boxes. Each shaded square represents a
  405. .CW VtEntry .
  406. Arrows represent pointers from
  407. .CW VtEntry
  408. structures to other
  409. Venti files or directories.
  410. .PP
  411. The hierarchical structure provided by Venti files and directories
  412. can be used as the base for more complicated data structures.
  413. Because this structure captures all the information
  414. about pointers to other blocks, tools written to traverse
  415. Venti hierarchies can traverse the more complicated
  416. data structures as well.
  417. For example,
  418. .I venti/copy
  419. (see
  420. .I ventiaux (8))
  421. copies a Venti hierarchy from one Venti server to another,
  422. given the root
  423. .CW VtEntry .
  424. Because the traditional file system described in later sections is
  425. layered on a Venti hierarchy,
  426. .I venti/copy
  427. can copy it without fully understanding its structure.
  428. .NH 1
  429. Vac file system format
  430. .HP
  431. The Venti archive format
  432. .I vac
  433. builds a traditional file system using a Venti hierarchy.
  434. Each vac file is implemented as a Venti file;
  435. each vac directory is implemented as a Venti
  436. directory and a Venti file to provide traditional file system metadata.
  437. The metadata is stored in a structure called a
  438. .CW DirEntry :
  439. .P1
  440. DirEntry:
  441. .ta +\w' 'u +\w' 'u
  442. magic[4] \f(CW0x1c4d9072\fP (DirMagic)\fP
  443. version[2] \f(CW9\fP
  444. elem[s] \fRname (final path element only)\fP
  445. entry[4] \fRentry number for Venti file or directory\fP
  446. gen[4] \fRgeneration number\fP
  447. mentry[4] \fRentry number for Venti file holding metadata\fP
  448. mgen[4] \fRgeneration number\fP
  449. qid[8] \fRunique file serial number\fP
  450. uid[s] \fRowner\fP
  451. gid[s] \fRgroup\fP
  452. mid[s] \fRlast modified by\fP
  453. mtime[4] \fRlast modification time\fP
  454. ctime[4] \fRcreation time\fP
  455. atime[4] \fRlast access time\fP
  456. mode[4] \fRmode bits\fP
  457. .P2
  458. The notation
  459. .CW name[s]
  460. denotes a string stored as a two-byte length
  461. and then that many bytes.
  462. The above describes Version 9 of the
  463. .CW DirEntry
  464. format. Versions 7 and 8 are very similar; they can be
  465. read by the current
  466. .I vac
  467. source code but are not written.
  468. Earlier versions were not widespread.
  469. A
  470. .CW DirEntry
  471. may be followed by optional extension sections, though none
  472. are currently used.
  473. The
  474. .CW mode
  475. bits include bits commonly used by
  476. Unix and Windows, in addition to those used by Plan 9.
  477. .PP
  478. The
  479. .CW entry
  480. field is an index into the parallel Venti directory.
  481. The
  482. .CW gen
  483. field must match the
  484. .CW gen
  485. field in the corresponding
  486. .CW VtEntry
  487. in the directory;
  488. it is used to detect
  489. stale indices.
  490. Similarly,
  491. .CW mentry
  492. and
  493. .CW mgen
  494. are the index and generation number
  495. for the metadata Venti file,
  496. if the
  497. .CW DirEntry
  498. describes a vac directory.
  499. .PP
  500. The relation between Venti files and directories and
  501. vac files and directories can be seen in this figure:
  502. .PS
  503. .ps 8
  504. .vs 10
  505. boxwid=0.1
  506. boxht=0.1
  507. f=0.9
  508. mb=0.16
  509. VtRoot: [
  510. right
  511. B1: box
  512. move right 0.1
  513. "\f(CWVtRoot\fP" ljust
  514. ]
  515. SuperRoot: [
  516. right
  517. B1: box fill f
  518. move right 0.1
  519. "fs root block" ljust
  520. ] with .nw at VtRoot.sw + (0.2, -.2)
  521. Root: [
  522. right
  523. B1: box fill f
  524. B2: box fill f
  525. B3: box fill f
  526. move right 0.1
  527. "root directory info block" ljust
  528. ] with .nw at SuperRoot.sw+(0.2, -.2)
  529. Level1: [
  530. RootMeta: [
  531. box wid mb
  532. move right 0.1
  533. "root metadata" ljust
  534. ]
  535. MetaSource: [
  536. right
  537. B1: box wid mb
  538. B2: box wid mb
  539. B3: box wid mb
  540. B4: box wid mb
  541. B5: box wid mb
  542. ] with .nw at RootMeta.sw+(0,-.2)
  543. MB1: box wid mb invis at MetaSource.B1
  544. MB2: box wid mb invis at MetaSource.B2
  545. MB3: box wid mb invis at MetaSource.B3
  546. MB4: box wid mb invis at MetaSource.B4
  547. MB5: box wid mb invis at MetaSource.B5
  548. Source: [
  549. right
  550. B1: box fill f
  551. B2: box fill f
  552. B3: box fill f
  553. B4: box fill f
  554. B5: box fill f
  555. B6: box fill f
  556. B7: box fill f
  557. B8: box fill f
  558. ] with .nw at MetaSource.sw+(0,-.1)
  559. SB1: box invis at Source.B1
  560. SB2: box invis at Source.B2
  561. SB3: box invis at Source.B3
  562. SB4: box invis at Source.B4
  563. SB5: box invis at Source.B5
  564. SB6: box invis at Source.B6
  565. SB7: box invis at Source.B7
  566. SB8: box invis at Source.B8
  567. ] with .nw at Root.sw+(0.4,-.2)
  568. Level2: [
  569. MetaSource: [
  570. right
  571. B1: box wid mb
  572. B2: box wid mb
  573. B3: box wid mb
  574. B4: box wid mb
  575. B5: box wid mb
  576. ]
  577. Source: [
  578. right
  579. B1: box fill f
  580. B2: box fill f
  581. B3: box fill f
  582. B4: box fill f
  583. B5: box fill f
  584. B6: box fill f
  585. B7: box fill f
  586. B8: box fill f
  587. ] with .nw at MetaSource.sw+(0,-.1)
  588. File: box wid 0.8 with .nw at Source.sw+(0,-.2)
  589. ] with .nw at Level1.sw+(0.6,-.2)
  590. line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w
  591. line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w
  592. line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w
  593. line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w
  594. line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w
  595. line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w
  596. line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w
  597. line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w
  598. arrowwid = arrowwid/2
  599. arrowht = arrowht/2
  600. line -> from Level1.MB1 to Level1.SB1.n
  601. line -> from Level1.MB2 to Level1.SB2.n
  602. line -> from Level1.MB2 to Level1.SB3.n
  603. line -> from Level1.MB4 to Level1.SB7.n
  604. line -> from Level1.MB5 to Level1.SB5.n
  605. arrowwid = arrowwid * 2
  606. arrowht = arrowht * 2
  607. box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
  608. box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
  609. box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2
  610. [
  611. KEY: box wid 1.5 invis "Key"
  612. line from KEY.sw to KEY.se
  613. k = -.1
  614. kk=0.5
  615. A: [
  616. box wid 4*boxwid
  617. "Venti file" ljust with .w at last box .w+(kk,0)
  618. ] with .nw at KEY.sw+(0,2*k)
  619. B: [
  620. box fill f
  621. "Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0)
  622. ] with .nw at A.sw+(0,k)
  623. C: [
  624. right
  625. CC: box fill f
  626. box fill f
  627. box fill f
  628. box fill f
  629. "Venti directory" ljust with .w at CC.w+(kk,0)
  630. ] with .nw at B.sw+(0,k)
  631. D: [
  632. line -> right 3*boxwid
  633. "Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
  634. ] with .nw at C.sw+(0,k)
  635. DD: [
  636. box dotted wid 4*boxwid
  637. "Vac file" ljust with .w at last box .w+(kk,0)
  638. ] with .nw at D.sw+(0,k)
  639. E: [
  640. box wid mb
  641. "Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0)
  642. ] with .nw at DD.sw+(0,k)
  643. G: [
  644. box dashed wid 4*boxwid
  645. "Vac directory" ljust with .w at last box .w+(kk,0)
  646. ] with .nw at E.sw+(0,k)
  647. H: [
  648. arrowwid = arrowwid/2
  649. arrowht = arrowht/2
  650. line -> right 1.5*boxwid
  651. "Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0)
  652. arrowwid = arrowwid * 2
  653. arrowht = arrowht * 2
  654. ] with .nw at G.sw+(0,k)
  655. ] with .nw at VtRoot.nw+(3,0)
  656. .PE
  657. .LP
  658. In reality, the story is slightly more complicated.
  659. The metadata file in a Vac directory
  660. is not just the concatenation of
  661. .CW DirEntry
  662. structures.
  663. Instead, it is the concatenation of
  664. .CW MetaBlocks .
  665. A
  666. .CW MetaBlock
  667. contains some number of
  668. .CW DirEntry
  669. structures along with a sorted index to make it easy
  670. to look for a particular
  671. .CW DirEntry
  672. by its
  673. .CW elem
  674. field.
  675. The details are in the source code.
  676. .PP
  677. As shown in the diagram,
  678. the root directory of the file system is summarized by
  679. three
  680. .CW VtEntry
  681. structures describing
  682. the Venti directory for the children of the root,
  683. the Venti file for the metadata describing the children of the root,
  684. and a Venti file holding metadata for the root directory itself.
  685. These
  686. .CW VtEntry
  687. structures are placed in a Venti directory of their own,
  688. described by the single
  689. .CW VtEntry
  690. in the
  691. root block.
  692. .NH 1
  693. Fossil file system format
  694. .HP
  695. Fossil uses the vac format, with some small changes.
  696. The changes only affect the data on the local disk; the data
  697. archived to Venti is exactly in vac format.
  698. .PP
  699. Blocks stored on local disk may contain scores pointing at local disk
  700. blocks or at Venti blocks.
  701. Local block addresses are stored as 20-byte scores in which the first 16 bytes
  702. are all zero and the last 4 bytes specify a block number in the disk.
  703. Before a block is archived, all the
  704. blocks it points to must be archived, and the local scores in the block
  705. must be changed to Venti scores.
  706. Using block addresses rather than content hashes for local data
  707. makes the local file system easier to manage: if a local block's contents
  708. change, the pointer to the block does not need to change.
  709. .NH 2
  710. Snapshots
  711. .HP
  712. Fossil is an archival file server.
  713. It takes periodic snapshots of the file system,
  714. which are made accessible through the file system.
  715. Specifically, the active file system is presented in
  716. .CW /active .
  717. Ephemeral snapshots (those that are kept on local disk and eventually deleted)
  718. are presented in
  719. \f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR,
  720. where
  721. .I yyyy
  722. is the full year,
  723. .I mm
  724. is the month number,
  725. .I dd
  726. is the day number,
  727. .I hh
  728. is the hour,
  729. and
  730. .I mm
  731. is the minute.
  732. Archival snapshots (those that are archived to Venti and persist forever)
  733. are presented in
  734. \f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR,
  735. where
  736. .I yyyy ,
  737. .I mm ,
  738. and
  739. .I dd
  740. are year, month, and day as before,
  741. and
  742. .I s
  743. is a sequence number if more than one
  744. archival snapshot is done in a day.
  745. For the first snapshot,
  746. .I s
  747. is null.
  748. For the subsequent snapshots,
  749. .I s
  750. is
  751. .CW .1 ,
  752. .CW .2 ,
  753. .CW .3 ,
  754. etc.
  755. .PP
  756. To implement the snapshots, the file server maintains a
  757. current
  758. .I epoch
  759. for the active file system.
  760. Each local block has a label that records, among other things,
  761. the epoch in which the block was allocated.
  762. If a block was allocated in an epoch earlier than the current one,
  763. it is immutable and treated as copy-on-write.
  764. Taking a snapshot can be accomplished by
  765. recording the address of the current root block and then
  766. incrementing the epoch number.
  767. Notice that the copy-on-write method makes
  768. snapshots both time efficient and space efficient.
  769. The only time cost is waiting for all current file system
  770. requests to finish and then incrementing a counter.
  771. After a snapshot, blocks only get copied when they are
  772. next modified, so the per-snapshot
  773. space requirement is proportional
  774. to the amount of new data rather than the total
  775. size of the file system.
  776. .PP
  777. The blocks in the archival snapshots are moved to Venti,
  778. but the blocks in the ephemeral snapshots take up space
  779. in the local disk file.
  780. To allow reclamation of this disk space, the file system
  781. maintains a
  782. .I low
  783. .I epoch ,
  784. which is the epoch of the earliest ephemeral snapshot
  785. still available.
  786. Fossil only allows access to snapshots with epoch numbers
  787. between the
  788. low epoch and the current epoch
  789. (also called the high epoch).
  790. Incrementing the low epoch thus makes old
  791. snapshots inaccessible.
  792. The space required to store those snapshots can then
  793. be reclaimed, as described below.
  794. .NH 2
  795. Local blocks
  796. .HP
  797. The bulk of the local disk file is the local blocks.
  798. Each block has a 14-byte label associated with it, of the format:
  799. .P1
  800. Label:
  801. .ta +\w' 'u +\w' 'u
  802. state[1] \fRblock state\fP
  803. type[1] \fRblock type\fP
  804. epoch[4] \fRallocation epoch\fP
  805. epochClose[4] \fRclose epoch\fP
  806. tag[4] \fRrandom tag\fP
  807. .P2
  808. .LP
  809. The
  810. .CW type
  811. is an analogue of the block types described earlier,
  812. though different names are used, to distinguish between
  813. pointers blocks in a hash tree for a data stream
  814. and pointer blocks for a directory stream.
  815. The
  816. .CW epoch
  817. was mentioned in the last section.
  818. The other fields are explained below.
  819. .PP
  820. There are two distinguished blocks states
  821. .CW BsFree
  822. .CW 0x00 ) (
  823. and
  824. .CW BsBad
  825. .CW 0xFF ), (
  826. which mark blocks that are available for allocation
  827. and blocks that are bad and should be avoided.
  828. If
  829. .CW state
  830. is not one of these values, it is a bitwise
  831. .I or ' `
  832. of the following flags:
  833. .P1
  834. .ta +\w' 'u +\w' 'u
  835. 0x01 BsAlloc \fRblock is in use\fP
  836. 0x02 BsCopied \fRblock has been copied\fP
  837. 0x04 BsVenti \fRblock has been stored on Venti\fP
  838. 0x08 BsClosed \fRblock has been unlinked from active file system\fP
  839. .P2
  840. .LP
  841. The flags are explained as they arise in the discussions below.
  842. .PP
  843. It is convenient to store some extra fields in the
  844. .CW VtEntry
  845. structure when it describes a Venti file or directory
  846. stored on local disk.
  847. Specifically, we set the
  848. .CW VtEntryLocal
  849. flag bit
  850. and then use the bytes 7-16 of the score (which would
  851. otherwise be zero, since it is a local score) to hold these fields:
  852. .P1
  853. .ta +\w' 'u +\w' 'u
  854. archive[1] \fRboolean: this is an archival snapshot\fP
  855. snap[4] \fRepoch number if root of snapshot\fP
  856. tag[4] \fRrandom tag\fP
  857. .P2
  858. .LP
  859. The extended
  860. .CW VtEntry
  861. structure is called an
  862. .CW Entry .
  863. The
  864. .CW tag
  865. field
  866. in the
  867. .CW Label
  868. and the
  869. .CW Entry
  870. is used to identify dangling pointers or other file system corruption:
  871. all the local blocks in a hash tree must
  872. have tags matching the tag in the
  873. .CW Entry .
  874. If this
  875. .CW Entry
  876. points at the root of a snapshot,
  877. the
  878. .CW snap
  879. field is the epoch of the snapshot.
  880. If the snapshot is intended to be archived to Venti,
  881. the
  882. .CW archive
  883. field is non-zero.
  884. .NH 2
  885. Block reclamation
  886. .HP
  887. The blocks in the active file system form a tree: each
  888. block has only one parent.
  889. Once a copy-on-write block
  890. .I b
  891. is replaced by its copy, it is no longer
  892. needed by the active file system.
  893. At this point,
  894. .I b
  895. is unlinked from the active file system.
  896. We say that
  897. .I b
  898. is now
  899. .I closed :
  900. it is needed only for snapshots.
  901. When a block is closed, the
  902. .CW BsClosed
  903. bit is set in its state, and the current epoch (called the block's closing epoch)
  904. is stored in the
  905. .CW epochClose
  906. label field.
  907. (Open blocks have an
  908. .CW epochClose
  909. of
  910. .CW ~0 ).
  911. .PP
  912. A block is referenced by snapshots with epochs
  913. between the block's allocation epoch and its closing epoch.
  914. Once the file system's low epoch grows to be greater than or equal to the block's
  915. closing epoch, the block is no longer needed for any snapshots
  916. and can be reused.
  917. .PP
  918. In a typical configuration, where nightly archival snapshots
  919. are taken and written to Venti, it is desirable to reclaim
  920. the space occupied by now-archived blocks if possible.
  921. To do this, Fossil keeps track of whether the pointers
  922. in each block are unique to that block.
  923. When a block
  924. .I bb
  925. is allocated, a pointer to
  926. .I bb
  927. is written into exactly one active block (say,
  928. .I b ).
  929. In the absence of snapshots, the pointer to
  930. .I bb
  931. will remain unique to
  932. .I b ,
  933. so that if the pointer is zeroed,
  934. .I bb
  935. can be immediately reused.
  936. Snapshots complicate this invariant:
  937. when
  938. .I b
  939. is copied-on-write, all its pointers
  940. are no longer unique to it.
  941. At time of the copy, the
  942. .CW BsCopied
  943. state bit in the block's label
  944. is set to note the duplication of the pointers contained within.
  945. .NH 2
  946. Disk layout
  947. .HP
  948. The file system header describes the file system layout and has this format:
  949. .P1
  950. .ta +\w' 'u +\w' 'u
  951. Header:
  952. magic[4] \fR0x3776AE89 (HeaderMagic)\fP
  953. version[2] \fR1 (HeaderVersion)\fP
  954. blockSize[2] \fIfile system block size\fP
  955. super[4] \fRblock offset of super block\fP
  956. label[4] \fRblock offset of labels\fP
  957. data[4] \fRdata blocks\fP
  958. end[4] \fRend of file system\fP
  959. .P2
  960. .LP
  961. The corresponding file system layout is:
  962. .PS
  963. .ps 8
  964. .vs 9
  965. boxwid=0.75
  966. boxht=0.15
  967. Empty: box "empty" ht 0.25
  968. Header: box "header" with .n at Empty.s
  969. Empty2: box "empty" with .n at Header.s
  970. Super: box "super block" with .n at Empty2.s
  971. Label: box "label" "blocks" with .n at Super.s ht 0.25
  972. Data: box "data" "blocks" with .n at Label.s ht 0.3
  973. " 0" ljust at Empty.ne
  974. " 128kB" ljust at Header.ne
  975. " \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne
  976. " \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne
  977. " \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne
  978. " \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se
  979. "" at (-1,0)
  980. "" at (6,0)
  981. .PE
  982. .LP
  983. The numbers to the right of the blocks are byte offsets
  984. of the boundaries.
  985. .LP
  986. The super block describes the file system itself and looks like:
  987. .P1
  988. .ta +\w' 'u +\w' 'u
  989. Super:
  990. magic[4] \fR0x2340A3B1 (SuperMagic)\fP
  991. version[2] \fR1 (SuperVersion)\fP
  992. epochLow[4] \fRfile system low epoch\fP
  993. epochHigh[4] \fRfile system high (active) epoch\fP
  994. qid[8] \fRnext qid to allocate\fP
  995. active[4] \fRdata block number: root of active file system\fP
  996. next[4] \fRdata block number: root of next file system to archive\fP
  997. current[4] \fRdata block number: root of file system currently being archived\fP
  998. last[20] \fRVenti score of last successful archive\fP
  999. name[128] \fRname of file system (just a comment)\fP
  1000. .P2
  1001. .LP
  1002. .NH 1
  1003. Fossil server
  1004. .HP
  1005. The Fossil server is a user-space program that runs on a standard Plan 9 kernel.
  1006. .NH 2
  1007. Process structure
  1008. .PP
  1009. The file server is structured as a set of processes synchronizing
  1010. mostly through message passing along queues.
  1011. The processes are given names, which can be seen in the output of
  1012. .CW ps
  1013. .CW -a .
  1014. .PP
  1015. .CW Listen
  1016. processes announce on various network addresses.
  1017. A
  1018. .CW con
  1019. process handles each incoming connection, reading 9P requests
  1020. and adding them to a central message queue.
  1021. .CW Msg
  1022. processes remove 9P requests from the queue,
  1023. handle them, and write the responses to the appropriate
  1024. file descriptors.
  1025. .PP
  1026. The
  1027. .CW disk
  1028. process handles disk I/O requests made by the other processes.
  1029. The
  1030. .CW flush
  1031. process writes dirty blocks from the in-memory block cache to disk.
  1032. The
  1033. .CW unlink
  1034. process frees previously linked blocks once the blocks that point at them
  1035. have been written to disk.
  1036. .PP
  1037. A
  1038. .CW consI
  1039. reads from each console file (typically a pipe posted in
  1040. .CW /srv ),
  1041. adding the typed characters to the input queue.
  1042. The
  1043. .CW cons
  1044. process echoes input and runs the commands, saving
  1045. output in a ring buffer.
  1046. Because there is only one
  1047. .CW cons
  1048. process, only one console command may be executing at a time.
  1049. A
  1050. .CW consO
  1051. process copies this ring buffer to the each console file.
  1052. .PP
  1053. The
  1054. .CW periodic
  1055. process runs periodic events, like
  1056. flushing the root metadata to disk or
  1057. taking snapshots of the file system.
  1058. .NH 2
  1059. Block cache
  1060. .HP
  1061. Fossil maintains an in-memory block cache which
  1062. holds both local disk blocks and Venti blocks.
  1063. Cache eviction follows a least recently used policy.
  1064. Dirty blocks are restricted to at most half the cache.
  1065. This can be changed by editing
  1066. .CW DirtyPercentage
  1067. in
  1068. .CW dat.h .
  1069. .PP
  1070. The block cache uses soft updates [1] to ensure that the on-disk
  1071. file system is always self-consistent.
  1072. Thus there is no
  1073. .I halt
  1074. console command
  1075. and no need to check a file system
  1076. that was shut down without halting.
  1077. .NH 2
  1078. Archiving
  1079. .HP
  1080. A background process writes blocks in archival snapshots to Venti.
  1081. Although
  1082. .CW /archive/\fIyyyy\fP/\fImmdds\fR
  1083. is a copy of only
  1084. .CW /active
  1085. at the time of the snapshot,
  1086. the archival process archives the
  1087. entire file tree rather than just
  1088. the subtree rooted at
  1089. .CW /active .
  1090. The snapshots
  1091. .CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm
  1092. are stored as empty directories.
  1093. Once all the blocks have been archived,
  1094. a
  1095. .CW VtRoot
  1096. header for the file system is archived.
  1097. The score of that header is recorded in
  1098. .CW super.score
  1099. and also printed on the file server console.
  1100. The score can used by
  1101. .I flfmt
  1102. to restore a file system (see
  1103. .I fossil (4)).
  1104. .NH 2
  1105. Contrast with the old file server
  1106. .HP
  1107. The most obvious difference between Fossil and the
  1108. old Plan 9 file server [2] is that Fossil uses a Venti server as
  1109. its archival storage in place of a WORM juke box.
  1110. There are a few other architectural differences to be
  1111. aware of.
  1112. .PP
  1113. Fossil is a user-level program run on a standard kernel.
  1114. .PP
  1115. Fossil does not have any way to concatenate, stripe, or
  1116. mirror disk files. For functionality similar to the old file server's
  1117. configuration strings, use the experimental file stack device
  1118. (see
  1119. .I fs (3)).
  1120. .PP
  1121. Fossil speaks only 9P2000. Old 9P (aka 9P1) is not supported.
  1122. .PP
  1123. ... XXX words about converting an old file system to fossil?
  1124. .NH 1
  1125. References
  1126. .LP
  1127. [1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,
  1128. and Yale N. Patt.
  1129. ``Soft Updates: A Solution to the Metadata Update Problem
  1130. in File Systems,''
  1131. .I "ACM Transactions on Computer Systems" ,
  1132. Vol 18., No. 2, May 2000, pp. 127\-153.
  1133. .LP
  1134. [2] Sean Quinlan, ``A Cached WORM File System,''
  1135. .I "Software\(emPractice and Experience" ,
  1136. Vol 21., No 12., December 1991, pp. 1289\-1299.
  1137. .LP
  1138. [3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,''
  1139. .I "Usenix Conference on File and Storage Technologies" ,
  1140. 2002.