1
0

fossil.ms 31 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162
  1. ... .FP times
  2. ... .fp 1 R R.nomath
  3. ... .fp 5 CW LucidaSansCW83
  4. .TL
  5. Fossil, an Archival File Server
  6. .AU
  7. Sean Quinlan
  8. Jim McKie
  9. Russ Cox
  10. jmk,rsc@plan9.bell-labs.com
  11. .AB
  12. This paper describes the internals and
  13. operation of Fossil, an archival file server built for Plan 9.
  14. Fossil has not yet replaced the current Plan 9 file server
  15. and
  16. .CW kfs ,
  17. but that is our eventual intent.
  18. Both fossil and this documentation are
  19. works in progress. Comments on either are most welcome.
  20. .AE
  21. .de HP
  22. .LP
  23. ..
  24. .NH 1
  25. Introduction
  26. .HP
  27. Fossil is an archival file server built for Plan 9.
  28. In a typical configuration, it maintains a traditional file system
  29. in a local disk partition and periodically archives snapshots of the file system
  30. to a Venti server. These archives are made available through
  31. a file system interface.
  32. Fossil can also be run without a Venti server, in which case the
  33. snapshots (if any) occupy local disk space.
  34. .PP
  35. The bulk of this paper explains the underlying data structures:
  36. Venti trees, the Venti archival file system format, and finally Fossil's
  37. file system format.
  38. The end of the paper discusses the architecture of the Fossil server.
  39. .PP
  40. The presentation of the data structures is very detailed, perhaps
  41. too detailed for most readers.
  42. The intent is to record all the details necessary to make structural
  43. changes to the file system format.
  44. Feel free to jump ahead when boredom strikes.
  45. .NH 1
  46. Venti trees and directory hierarchies
  47. .HP
  48. Venti [3] is an archival block storage server.
  49. Once a block is stored, it can be retrieved by presenting the 20-byte
  50. SHA1 hash of its contents, called a
  51. .I score .
  52. Blocks on Venti have a maximum length of about 56 kilobytes,
  53. though in practice smaller blocks are used.
  54. To store a byte stream of arbitrary length, Venti uses a hash tree.
  55. Conceptually, the data stream is broken into fixed-size (say,
  56. .I dsize -byte)
  57. chunks, which are stored on the Venti server.
  58. The resulting scores are concatenated into a new pointer stream, which is
  59. broken into fixed size (say,
  60. .I psize -byte)
  61. chunks, which are stored on the Venti server.
  62. .I Psize "" (
  63. is different from
  64. .I dsize
  65. so that we can ensure that each pointer block holds an
  66. integral number of pointers.)
  67. This yields a new pointer stream, and so on, until there is a single block
  68. and finally a single score describing the entire tree.
  69. The resulting structure looks like:
  70. .PS
  71. .ps 8
  72. .vs 10
  73. boxht=0.1
  74. boxwid=0.1
  75. B0: box invis wid 1 "\f(CWVtDataType\fP"
  76. move right 0.1
  77. L0a: box wid 0.2
  78. move right 0.1
  79. L0b: box wid 0.2
  80. move right 0.1
  81. L0c: box invis wid 0.2 "..."
  82. move right 0.1
  83. L0d: box wid 0.2
  84. move right 0.1
  85. L0e: box wid 0.2
  86. move right 0.2
  87. L0f: box invis wid 0.2 "..."
  88. move right 0.2
  89. L0g: box wid 0.2
  90. move right 0.1
  91. L0h: box wid 0.2
  92. move right 0.1
  93. L0i: box invis wid 0.2 "..."
  94. move right 0.1
  95. L0j: box wid 0.2
  96. move right 0.1
  97. L0k: box wid 0.2
  98. move right 0.1
  99. L0l: box invis wid 0.2 "..."
  100. move right 0.1
  101. L0m: box wid 0.2
  102. define boxppddd {
  103. line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
  104. line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
  105. X: box invis at 0.1<$1.nw,$1.ne>
  106. Y: box invis at 0.1<$1.sw,$1.se>
  107. line -> from 0.5<X,Y> to $2.nw
  108. X: box invis at 0.3<$1.nw,$1.ne>
  109. Y: box invis at 0.3<$1.sw,$1.se>
  110. line -> from 0.5<X,Y> to $3.nw
  111. "..." at 0.7<$1.w,$1.e>
  112. }
  113. define boxppdddp {
  114. line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
  115. line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
  116. line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
  117. X: box invis at 0.1<$1.nw,$1.ne>
  118. Y: box invis at 0.1<$1.sw,$1.se>
  119. line -> from 0.5<X,Y> to $2.nw
  120. X: box invis at 0.3<$1.nw,$1.ne>
  121. Y: box invis at 0.3<$1.sw,$1.se>
  122. line -> from 0.5<X,Y> to $3.nw
  123. "..." at 0.6<$1.w,$1.e>
  124. X: box invis at 0.9<$1.nw,$1.ne>
  125. Y: box invis at 0.9<$1.sw,$1.se>
  126. line -> from 0.5<X,Y> to $4.nw
  127. }
  128. define boxpdddp {
  129. line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
  130. line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
  131. X: box invis at 0.1<$1.nw,$1.ne>
  132. Y: box invis at 0.1<$1.sw,$1.se>
  133. line -> from 0.5<X,Y> to $2.nw
  134. "..." at 0.5<$1.w,$1.e>
  135. X: box invis at 0.9<$1.nw,$1.ne>
  136. Y: box invis at 0.9<$1.sw,$1.se>
  137. line -> from 0.5<X,Y> to $3.nw
  138. }
  139. bhd=0.4
  140. L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd)
  141. boxppddd(L1abc, L0a, L0b)
  142. L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd)
  143. boxppddd(L1def, L0d, L0e)
  144. L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd)
  145. boxppddd(L1ghi, L0g, L0h)
  146. L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd)
  147. boxppdddp(L1jklm, L0j, L0k, L0m)
  148. B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd)
  149. L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd)
  150. boxppddd(L2abcdef, L1abc, L1def)
  151. L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd)
  152. boxpdddp(L2ghijklm, L1ghi, L1jklm)
  153. B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd)
  154. L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd)
  155. boxpdddp(L3atom, L2abcdef, L2ghijklm)
  156. B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd)
  157. .PE
  158. .LP
  159. The leaves are the original data stream. Those blocks have type
  160. .CW VtDataType .
  161. The first pointer stream has type
  162. .CW VtPointerType0 ,
  163. the next has type
  164. .CW VtPointerType1 ,
  165. and so on.
  166. The figure ends with a single block of type
  167. .CW VtPointerType2 ,
  168. but in general trees can have height up to
  169. .CW VtPointerType6 .
  170. For a
  171. .I dsize
  172. of 8192 bytes
  173. and
  174. .I psize
  175. of 8180 bytes (409 pointers),
  176. this gives a maximum stream size of approximately 10 zettabytes
  177. (2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes).
  178. .PP
  179. Data blocks are truncated to remove trailing runs of zeros before
  180. storage to Venti; they are zero-filled back to
  181. .I dsize
  182. bytes after retrieval from Venti.
  183. Similarly, trailing runs of pointers to zero-length blocks are
  184. removed from and added back to pointer blocks.
  185. These simple rules happen to make it particularly efficient to store
  186. large runs of zeros, as might occur in a data stream with ``holes:''
  187. the zero-length block itself can be interpreted as a tree of any
  188. depth encoding an all-zero data stream.
  189. .PP
  190. Reconstructing the data stream requires the score and type of the
  191. topmost block in the tree, the data chunk size, the pointer chunk size,
  192. and the data stream size.
  193. (From the data stream size and the chunk sizes we could derive the
  194. depth of the tree and thus the type of the topmost block, but it is convenient
  195. to allow trees that are deeper than necessary.)
  196. This information is kept in a 40-byte structure called a
  197. .CW VtEntry :
  198. .P1
  199. VtEntry:
  200. .ta +\w' 'u +\w' 'u
  201. gen[4] \fRgeneration number\fP
  202. psize[2] \fRsize of pointer blocks\fP
  203. dsize[2] \fRsize of data blocks\fP
  204. flags[1]
  205. zero[5]
  206. size[6] \fRlength of file\fP
  207. score[20] \fRscore of root block in tree\fP
  208. .P2
  209. (In this notation,
  210. .CW name[sz]
  211. indicates a
  212. .CW sz -byte
  213. field called
  214. .CW name .
  215. Integers are stored in big-endian order.
  216. .CW Size
  217. really is a 48-bit field.)
  218. .CW Flags
  219. is made up of the following bit fields.
  220. .P1
  221. .ta +\w' 'u +\w' 'u
  222. 0x01 VtEntryActive \fRentry is allocated\fP
  223. 0x02 VtEntryDir \fRentry describes a Venti directory (q.v.)\fP
  224. 0x1C VtEntryDepthMask \fRmask for tree depth\fP
  225. 0x20 VtEntryLocal \fRreserved (q.v.)\fP
  226. .P2
  227. .LP
  228. The depth of the described tree is stored in the 3 bits indicated:
  229. a tree with a topmost node of type
  230. .CW VtPointerType3
  231. has depth 4.
  232. .PP
  233. With
  234. .CW VtEntry
  235. we can build more complicated data structures,
  236. ones with multiple or nested data streams.
  237. A data stream consisting of
  238. .CW VtEntry
  239. structures is called a Venti directory.
  240. It is identical in structure to the Venti data stream
  241. we described earlier except that the bottom-level type is
  242. .CW VtDirType ,
  243. and
  244. the
  245. .CW VtEntry
  246. describing a Venti directory has the
  247. .CW VtEntryDir
  248. flag bit set.
  249. The
  250. .I dsize
  251. for a Venti directory
  252. is a multiple of 40 so that each data chunk holds
  253. an integer number of
  254. .CW VtEntry
  255. structures.
  256. By analogy with Venti directories,
  257. we call the original data stream a
  258. Venti file.
  259. Note that Venti files are assumed
  260. .I not
  261. to contain pointers to other Venti blocks.
  262. The only pointers to Venti blocks occur in
  263. .CW VtEntry
  264. structures in
  265. Venti directories
  266. (and in the internal hash tree structure of the
  267. individual files and directories).
  268. Note also that these directories are nothing more than pointer lists.
  269. In particular, there are no names or metadata like in a file system.
  270. .PP
  271. To make it easier to pass hierarchies between applications,
  272. the root of a hierarchy is described in a 300-byte structure
  273. called a
  274. .CW VtRoot :
  275. .P1
  276. VtRoot:
  277. .ta +\w' 'u +\w' 'u
  278. version[2] \f(CW2\fP
  279. name[128] \fRname of structure (just a comment)\fP
  280. type[128] \fRstring describing structure (\f(CWvac\fR)\f(CW
  281. score[20] \fRpointer to \f(CWVtDirType\fP block\f(CW
  282. blockSize[2] \fRmaximum block size in structure\fP
  283. prev[20] \fRprevious \f(CWVtRoot\fP in chain, if any\fP
  284. .P2
  285. .LP
  286. This structure is stored to Venti and its score is passed
  287. between applications, typically in the form
  288. ``\fItype\f(CW:\fIrootscore\fR,''
  289. where
  290. .I type
  291. is the type field from the
  292. .CW VtRoot
  293. structure, and
  294. .I rootscore
  295. is the score of the
  296. .CW VtRoot
  297. block.
  298. .CW VtRoot
  299. structures can be chained together using the
  300. .I prev
  301. field to encode an archival history
  302. of the data structure.
  303. .PP
  304. For example, a small Venti hierarchy might look like:
  305. .PS
  306. .ps 8
  307. .vs 10
  308. boxwid=0.1
  309. boxht=0.1
  310. f=0.9
  311. mb=0.16
  312. VtRoot: [
  313. right
  314. B1: box
  315. move right 0.1
  316. "\f(CWVtRoot\fP" ljust
  317. ]
  318. Root: [
  319. right
  320. B1: box fill f
  321. B2: box fill f
  322. B3: box fill f
  323. move right 0.1
  324. ] with .nw at VtRoot.sw+(0.2,-.1)
  325. Level1: [
  326. RootMeta: [
  327. box wid mb
  328. ]
  329. MetaSource: [
  330. right
  331. B1: box wid 5*mb
  332. ] with .nw at RootMeta.sw+(0,-.1)
  333. Source: [
  334. right
  335. B1: box fill f
  336. B2: box fill f
  337. B3: box fill f
  338. B4: box fill f
  339. B5: box fill f
  340. B6: box fill f
  341. B7: box fill f
  342. B8: box fill f
  343. ] with .nw at MetaSource.sw+(0,-.1)
  344. SB1: box invis at Source.B1
  345. SB2: box invis at Source.B2
  346. SB3: box invis at Source.B3
  347. ] with .nw at Root.sw+(0.4,-.1)
  348. Level2: [
  349. MetaSource: [
  350. right
  351. B1: box wid 5*mb
  352. ]
  353. Source: [
  354. right
  355. B1: box fill f
  356. B2: box fill f
  357. B3: box fill f
  358. B4: box fill f
  359. B5: box fill f
  360. B6: box fill f
  361. B7: box fill f
  362. B8: box fill f
  363. ] with .nw at MetaSource.sw+(0,-.1)
  364. File: box wid 0.8 with .nw at Source.sw+(0,-.1)
  365. ] with .nw at Level1.sw+(0.6,-.1)
  366. line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w
  367. line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w
  368. line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w
  369. line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w
  370. line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w
  371. line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w
  372. line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w
  373. [
  374. KEY: box wid 1.5 invis "Key"
  375. line from KEY.sw to KEY.se
  376. k = -.1
  377. kk=0.5
  378. A: [
  379. box wid 4*boxwid
  380. "Venti file" ljust with .w at last box .w+(kk,0)
  381. ] with .nw at KEY.sw+(0,2*k)
  382. B: [
  383. box fill f
  384. "Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0)
  385. ] with .nw at A.sw+(0,k)
  386. C: [
  387. right
  388. CC: box fill f
  389. box fill f
  390. box fill f
  391. box fill f
  392. "Venti directory" ljust with .w at CC.w+(kk,0)
  393. ] with .nw at B.sw+(0,k)
  394. D: [
  395. line -> right 3*boxwid
  396. "Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
  397. ] with .nw at C.sw+(0,k)
  398. ] with .nw at VtRoot.nw+(3,0)
  399. .PE
  400. .LP
  401. Venti files are shown as white boxes, while directories are shown
  402. as shaded boxes. Each shaded square represents a
  403. .CW VtEntry .
  404. Arrows represent pointers from
  405. .CW VtEntry
  406. structures to other
  407. Venti files or directories.
  408. .PP
  409. The hierarchical structure provided by Venti files and directories
  410. can be used as the base for more complicated data structures.
  411. Because this structure captures all the information
  412. about pointers to other blocks, tools written to traverse
  413. Venti hierarchies can traverse the more complicated
  414. data structures as well.
  415. For example,
  416. .I venti/copy
  417. (see
  418. .I ventiaux (8))
  419. copies a Venti hierarchy from one Venti server to another,
  420. given the root
  421. .CW VtEntry .
  422. Because the traditional file system described in later sections is
  423. layered on a Venti hierarchy,
  424. .I venti/copy
  425. can copy it without fully understanding its structure.
  426. .NH 1
  427. Vac file system format
  428. .HP
  429. The Venti archive format
  430. .I vac
  431. builds a traditional file system using a Venti hierarchy.
  432. Each vac file is implemented as a Venti file;
  433. each vac directory is implemented as a Venti
  434. directory and a Venti file to provide traditional file system metadata.
  435. The metadata is stored in a structure called a
  436. .CW DirEntry :
  437. .P1
  438. DirEntry:
  439. .ta +\w' 'u +\w' 'u
  440. magic[4] \f(CW0x1c4d9072\fP (DirMagic)\fP
  441. version[2] \f(CW9\fP
  442. elem[s] \fRname (final path element only)\fP
  443. entry[4] \fRentry number for Venti file or directory\fP
  444. gen[4] \fRgeneration number\fP
  445. mentry[4] \fRentry number for Venti file holding metadata\fP
  446. mgen[4] \fRgeneration number\fP
  447. qid[8] \fRunique file serial number\fP
  448. uid[s] \fRowner\fP
  449. gid[s] \fRgroup\fP
  450. mid[s] \fRlast modified by\fP
  451. mtime[4] \fRlast modification time\fP
  452. ctime[4] \fRcreation time\fP
  453. atime[4] \fRlast access time\fP
  454. mode[4] \fRmode bits\fP
  455. .P2
  456. The notation
  457. .CW name[s]
  458. denotes a string stored as a two-byte length
  459. and then that many bytes.
  460. The above describes Version 9 of the
  461. .CW DirEntry
  462. format. Versions 7 and 8 are very similar; they can be
  463. read by the current
  464. .I vac
  465. source code but are not written.
  466. Earlier versions were not widespread.
  467. A
  468. .CW DirEntry
  469. may be followed by optional extension sections, though none
  470. are currently used.
  471. The
  472. .CW mode
  473. bits include bits commonly used by
  474. Unix and Windows, in addition to those used by Plan 9.
  475. .PP
  476. The
  477. .CW entry
  478. field is an index into the parallel Venti directory.
  479. The
  480. .CW gen
  481. field must match the
  482. .CW gen
  483. field in the corresponding
  484. .CW VtEntry
  485. in the directory;
  486. it is used to detect
  487. stale indices.
  488. Similarly,
  489. .CW mentry
  490. and
  491. .CW mgen
  492. are the index and generation number
  493. for the metadata Venti file,
  494. if the
  495. .CW DirEntry
  496. describes a vac directory.
  497. .PP
  498. The relation between Venti files and directories and
  499. vac files and directories can be seen in this figure:
  500. .PS
  501. .ps 8
  502. .vs 10
  503. boxwid=0.1
  504. boxht=0.1
  505. f=0.9
  506. mb=0.16
  507. VtRoot: [
  508. right
  509. B1: box
  510. move right 0.1
  511. "\f(CWVtRoot\fP" ljust
  512. ]
  513. SuperRoot: [
  514. right
  515. B1: box fill f
  516. move right 0.1
  517. "fs root block" ljust
  518. ] with .nw at VtRoot.sw + (0.2, -.2)
  519. Root: [
  520. right
  521. B1: box fill f
  522. B2: box fill f
  523. B3: box fill f
  524. move right 0.1
  525. "root directory info block" ljust
  526. ] with .nw at SuperRoot.sw+(0.2, -.2)
  527. Level1: [
  528. RootMeta: [
  529. box wid mb
  530. move right 0.1
  531. "root metadata" ljust
  532. ]
  533. MetaSource: [
  534. right
  535. B1: box wid mb
  536. B2: box wid mb
  537. B3: box wid mb
  538. B4: box wid mb
  539. B5: box wid mb
  540. ] with .nw at RootMeta.sw+(0,-.2)
  541. MB1: box wid mb invis at MetaSource.B1
  542. MB2: box wid mb invis at MetaSource.B2
  543. MB3: box wid mb invis at MetaSource.B3
  544. MB4: box wid mb invis at MetaSource.B4
  545. MB5: box wid mb invis at MetaSource.B5
  546. Source: [
  547. right
  548. B1: box fill f
  549. B2: box fill f
  550. B3: box fill f
  551. B4: box fill f
  552. B5: box fill f
  553. B6: box fill f
  554. B7: box fill f
  555. B8: box fill f
  556. ] with .nw at MetaSource.sw+(0,-.1)
  557. SB1: box invis at Source.B1
  558. SB2: box invis at Source.B2
  559. SB3: box invis at Source.B3
  560. SB4: box invis at Source.B4
  561. SB5: box invis at Source.B5
  562. SB6: box invis at Source.B6
  563. SB7: box invis at Source.B7
  564. SB8: box invis at Source.B8
  565. ] with .nw at Root.sw+(0.4,-.2)
  566. Level2: [
  567. MetaSource: [
  568. right
  569. B1: box wid mb
  570. B2: box wid mb
  571. B3: box wid mb
  572. B4: box wid mb
  573. B5: box wid mb
  574. ]
  575. Source: [
  576. right
  577. B1: box fill f
  578. B2: box fill f
  579. B3: box fill f
  580. B4: box fill f
  581. B5: box fill f
  582. B6: box fill f
  583. B7: box fill f
  584. B8: box fill f
  585. ] with .nw at MetaSource.sw+(0,-.1)
  586. File: box wid 0.8 with .nw at Source.sw+(0,-.2)
  587. ] with .nw at Level1.sw+(0.6,-.2)
  588. line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w
  589. line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w
  590. line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w
  591. line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w
  592. line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w
  593. line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w
  594. line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w
  595. line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w
  596. arrowwid = arrowwid/2
  597. arrowht = arrowht/2
  598. line -> from Level1.MB1 to Level1.SB1.n
  599. line -> from Level1.MB2 to Level1.SB2.n
  600. line -> from Level1.MB2 to Level1.SB3.n
  601. line -> from Level1.MB4 to Level1.SB7.n
  602. line -> from Level1.MB5 to Level1.SB5.n
  603. arrowwid = arrowwid * 2
  604. arrowht = arrowht * 2
  605. box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
  606. box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
  607. box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2
  608. [
  609. KEY: box wid 1.5 invis "Key"
  610. line from KEY.sw to KEY.se
  611. k = -.1
  612. kk=0.5
  613. A: [
  614. box wid 4*boxwid
  615. "Venti file" ljust with .w at last box .w+(kk,0)
  616. ] with .nw at KEY.sw+(0,2*k)
  617. B: [
  618. box fill f
  619. "Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0)
  620. ] with .nw at A.sw+(0,k)
  621. C: [
  622. right
  623. CC: box fill f
  624. box fill f
  625. box fill f
  626. box fill f
  627. "Venti directory" ljust with .w at CC.w+(kk,0)
  628. ] with .nw at B.sw+(0,k)
  629. D: [
  630. line -> right 3*boxwid
  631. "Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
  632. ] with .nw at C.sw+(0,k)
  633. DD: [
  634. box dotted wid 4*boxwid
  635. "Vac file" ljust with .w at last box .w+(kk,0)
  636. ] with .nw at D.sw+(0,k)
  637. E: [
  638. box wid mb
  639. "Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0)
  640. ] with .nw at DD.sw+(0,k)
  641. G: [
  642. box dashed wid 4*boxwid
  643. "Vac directory" ljust with .w at last box .w+(kk,0)
  644. ] with .nw at E.sw+(0,k)
  645. H: [
  646. arrowwid = arrowwid/2
  647. arrowht = arrowht/2
  648. line -> right 1.5*boxwid
  649. "Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0)
  650. arrowwid = arrowwid * 2
  651. arrowht = arrowht * 2
  652. ] with .nw at G.sw+(0,k)
  653. ] with .nw at VtRoot.nw+(3,0)
  654. .PE
  655. .LP
  656. In reality, the story is slightly more complicated.
  657. The metadata file in a Vac directory
  658. is not just the concatenation of
  659. .CW DirEntry
  660. structures.
  661. Instead, it is the concatenation of
  662. .CW MetaBlocks .
  663. A
  664. .CW MetaBlock
  665. contains some number of
  666. .CW DirEntry
  667. structures along with a sorted index to make it easy
  668. to look for a particular
  669. .CW DirEntry
  670. by its
  671. .CW elem
  672. field.
  673. The details are in the source code.
  674. .PP
  675. As shown in the diagram,
  676. the root directory of the file system is summarized by
  677. three
  678. .CW VtEntry
  679. structures describing
  680. the Venti directory for the children of the root,
  681. the Venti file for the metadata describing the children of the root,
  682. and a Venti file holding metadata for the root directory itself.
  683. These
  684. .CW VtEntry
  685. structures are placed in a Venti directory of their own,
  686. described by the single
  687. .CW VtEntry
  688. in the
  689. root block.
  690. .NH 1
  691. Fossil file system format
  692. .HP
  693. Fossil uses the vac format, with some small changes.
  694. The changes only affect the data on the local disk; the data
  695. archived to Venti is exactly in vac format.
  696. .PP
  697. Blocks stored on local disk may contain scores pointing at local disk
  698. blocks or at Venti blocks.
  699. Local block addresses are stored as 20-byte scores in which the first 16 bytes
  700. are all zero and the last 4 bytes specify a block number in the disk.
  701. Before a block is archived, all the
  702. blocks it points to must be archived, and the local scores in the block
  703. must be changed to Venti scores.
  704. Using block addresses rather than content hashes for local data
  705. makes the local file system easier to manage: if a local block's contents
  706. change, the pointer to the block does not need to change.
  707. .NH 2
  708. Snapshots
  709. .HP
  710. Fossil is an archival file server.
  711. It takes periodic snapshots of the file system,
  712. which are made accessible through the file system.
  713. Specifically, the active file system is presented in
  714. .CW /active .
  715. Ephemeral snapshots (those that are kept on local disk and eventually deleted)
  716. are presented in
  717. \f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR,
  718. where
  719. .I yyyy
  720. is the full year,
  721. .I mm
  722. is the month number,
  723. .I dd
  724. is the day number,
  725. .I hh
  726. is the hour,
  727. and
  728. .I mm
  729. is the minute.
  730. Archival snapshots (those that are archived to Venti and persist forever)
  731. are presented in
  732. \f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR,
  733. where
  734. .I yyyy ,
  735. .I mm ,
  736. and
  737. .I dd
  738. are year, month, and day as before,
  739. and
  740. .I s
  741. is a sequence number if more than one
  742. archival snapshot is done in a day.
  743. For the first snapshot,
  744. .I s
  745. is null.
  746. For the subsequent snapshots,
  747. .I s
  748. is
  749. .CW .1 ,
  750. .CW .2 ,
  751. .CW .3 ,
  752. etc.
  753. .PP
  754. To implement the snapshots, the file server maintains a
  755. current
  756. .I epoch
  757. for the active file system.
  758. Each local block has a label that records, among other things,
  759. the epoch in which the block was allocated.
  760. If a block was allocated in an epoch earlier than the current one,
  761. it is immutable and treated as copy-on-write.
  762. Taking a snapshot can be accomplished by
  763. recording the address of the current root block and then
  764. incrementing the epoch number.
  765. Notice that the copy-on-write method makes
  766. snapshots both time efficient and space efficient.
  767. The only time cost is waiting for all current file system
  768. requests to finish and then incrementing a counter.
  769. After a snapshot, blocks only get copied when they are
  770. next modified, so the per-snapshot
  771. space requirement is proportional
  772. to the amount of new data rather than the total
  773. size of the file system.
  774. .PP
  775. The blocks in the archival snapshots are moved to Venti,
  776. but the blocks in the ephemeral snapshots take up space
  777. in the local disk file.
  778. To allow reclamation of this disk space, the file system
  779. maintains a
  780. .I low
  781. .I epoch ,
  782. which is the epoch of the earliest ephemeral snapshot
  783. still available.
  784. Fossil only allows access to snapshots with epoch numbers
  785. between the
  786. low epoch and the current epoch
  787. (also called the high epoch).
  788. Incrementing the low epoch thus makes old
  789. snapshots inaccessible.
  790. The space required to store those snapshots can then
  791. be reclaimed, as described below.
  792. .NH 2
  793. Local blocks
  794. .HP
  795. The bulk of the local disk file is the local blocks.
  796. Each block has a 14-byte label associated with it, of the format:
  797. .P1
  798. Label:
  799. .ta +\w' 'u +\w' 'u
  800. state[1] \fRblock state\fP
  801. type[1] \fRblock type\fP
  802. epoch[4] \fRallocation epoch\fP
  803. epochClose[4] \fRclose epoch\fP
  804. tag[4] \fRrandom tag\fP
  805. .P2
  806. .LP
  807. The
  808. .CW type
  809. is an analogue of the block types described earlier,
  810. though different names are used, to distinguish between
  811. pointers blocks in a hash tree for a data stream
  812. and pointer blocks for a directory stream.
  813. The
  814. .CW epoch
  815. was mentioned in the last section.
  816. The other fields are explained below.
  817. .PP
  818. There are two distinguished blocks states
  819. .CW BsFree
  820. .CW 0x00 ) (
  821. and
  822. .CW BsBad
  823. .CW 0xFF ), (
  824. which mark blocks that are available for allocation
  825. and blocks that are bad and should be avoided.
  826. If
  827. .CW state
  828. is not one of these values, it is a bitwise
  829. .I or ' `
  830. of the following flags:
  831. .P1
  832. .ta +\w' 'u +\w' 'u
  833. 0x01 BsAlloc \fRblock is in use\fP
  834. 0x02 BsCopied \fRblock has been copied\fP
  835. 0x04 BsVenti \fRblock has been stored on Venti\fP
  836. 0x08 BsClosed \fRblock has been unlinked from active file system\fP
  837. .P2
  838. .LP
  839. The flags are explained as they arise in the discussions below.
  840. .PP
  841. It is convenient to store some extra fields in the
  842. .CW VtEntry
  843. structure when it describes a Venti file or directory
  844. stored on local disk.
  845. Specifically, we set the
  846. .CW VtEntryLocal
  847. flag bit
  848. and then use the bytes 7-16 of the score (which would
  849. otherwise be zero, since it is a local score) to hold these fields:
  850. .P1
  851. .ta +\w' 'u +\w' 'u
  852. archive[1] \fRboolean: this is an archival snapshot\fP
  853. snap[4] \fRepoch number if root of snapshot\fP
  854. tag[4] \fRrandom tag\fP
  855. .P2
  856. .LP
  857. The extended
  858. .CW VtEntry
  859. structure is called an
  860. .CW Entry .
  861. The
  862. .CW tag
  863. field
  864. in the
  865. .CW Label
  866. and the
  867. .CW Entry
  868. is used to identify dangling pointers or other file system corruption:
  869. all the local blocks in a hash tree must
  870. have tags matching the tag in the
  871. .CW Entry .
  872. If this
  873. .CW Entry
  874. points at the root of a snapshot,
  875. the
  876. .CW snap
  877. field is the epoch of the snapshot.
  878. If the snapshot is intended to be archived to Venti,
  879. the
  880. .CW archive
  881. field is non-zero.
  882. .NH 2
  883. Block reclamation
  884. .HP
  885. The blocks in the active file system form a tree: each
  886. block has only one parent.
  887. Once a copy-on-write block
  888. .I b
  889. is replaced by its copy, it is no longer
  890. needed by the active file system.
  891. At this point,
  892. .I b
  893. is unlinked from the active file system.
  894. We say that
  895. .I b
  896. is now
  897. .I closed :
  898. it is needed only for snapshots.
  899. When a block is closed, the
  900. .CW BsClosed
  901. bit is set in its state, and the current epoch (called the block's closing epoch)
  902. is stored in the
  903. .CW epochClose
  904. label field.
  905. (Open blocks have an
  906. .CW epochClose
  907. of
  908. .CW ~0 ).
  909. .PP
  910. A block is referenced by snapshots with epochs
  911. between the block's allocation epoch and its closing epoch.
  912. Once the file system's low epoch grows to be greater than or equal to the block's
  913. closing epoch, the block is no longer needed for any snapshots
  914. and can be reused.
  915. .PP
  916. In a typical configuration, where nightly archival snapshots
  917. are taken and written to Venti, it is desirable to reclaim
  918. the space occupied by now-archived blocks if possible.
  919. To do this, Fossil keeps track of whether the pointers
  920. in each block are unique to that block.
  921. When a block
  922. .I bb
  923. is allocated, a pointer to
  924. .I bb
  925. is written into exactly one active block (say,
  926. .I b ).
  927. In the absence of snapshots, the pointer to
  928. .I bb
  929. will remain unique to
  930. .I b ,
  931. so that if the pointer is zeroed,
  932. .I bb
  933. can be immediately reused.
  934. Snapshots complicate this invariant:
  935. when
  936. .I b
  937. is copied-on-write, all its pointers
  938. are no longer unique to it.
  939. At time of the copy, the
  940. .CW BsCopied
  941. state bit in the block's label
  942. is set to note the duplication of the pointers contained within.
  943. .NH 2
  944. Disk layout
  945. .HP
  946. The file system header describes the file system layout and has this format:
  947. .P1
  948. .ta +\w' 'u +\w' 'u
  949. Header:
  950. magic[4] \fR0x3776AE89 (HeaderMagic)\fP
  951. version[2] \fR1 (HeaderVersion)\fP
  952. blockSize[2] \fIfile system block size\fP
  953. super[4] \fRblock offset of super block\fP
  954. label[4] \fRblock offset of labels\fP
  955. data[4] \fRdata blocks\fP
  956. end[4] \fRend of file system\fP
  957. .P2
  958. .LP
  959. The corresponding file system layout is:
  960. .PS
  961. .ps 8
  962. .vs 9
  963. boxwid=0.75
  964. boxht=0.15
  965. Empty: box "empty" ht 0.25
  966. Header: box "header" with .n at Empty.s
  967. Empty2: box "empty" with .n at Header.s
  968. Super: box "super block" with .n at Empty2.s
  969. Label: box "label" "blocks" with .n at Super.s ht 0.25
  970. Data: box "data" "blocks" with .n at Label.s ht 0.3
  971. " 0" ljust at Empty.ne
  972. " 128kB" ljust at Header.ne
  973. " \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne
  974. " \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne
  975. " \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne
  976. " \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se
  977. "" at (-1,0)
  978. "" at (6,0)
  979. .PE
  980. .LP
  981. The numbers to the right of the blocks are byte offsets
  982. of the boundaries.
  983. .LP
  984. The super block describes the file system itself and looks like:
  985. .P1
  986. .ta +\w' 'u +\w' 'u
  987. Super:
  988. magic[4] \fR0x2340A3B1 (SuperMagic)\fP
  989. version[2] \fR1 (SuperVersion)\fP
  990. epochLow[4] \fRfile system low epoch\fP
  991. epochHigh[4] \fRfile system high (active) epoch\fP
  992. qid[8] \fRnext qid to allocate\fP
  993. active[4] \fRdata block number: root of active file system\fP
  994. next[4] \fRdata block number: root of next file system to archive\fP
  995. current[4] \fRdata block number: root of file system currently being archived\fP
  996. last[20] \fRVenti score of last successful archive\fP
  997. name[128] \fRname of file system (just a comment)\fP
  998. .P2
  999. .LP
  1000. .NH 1
  1001. Fossil server
  1002. .HP
  1003. The Fossil server is a user-space program that runs on a standard Plan 9 kernel.
  1004. .NH 2
  1005. Process structure
  1006. .PP
  1007. The file server is structured as a set of processes synchronizing
  1008. mostly through message passing along queues.
  1009. The processes are given names, which can be seen in the output of
  1010. .CW ps
  1011. .CW -a .
  1012. .PP
  1013. .CW Listen
  1014. processes announce on various network addresses.
  1015. A
  1016. .CW con
  1017. process handles each incoming connection, reading 9P requests
  1018. and adding them to a central message queue.
  1019. .CW Msg
  1020. processes remove 9P requests from the queue,
  1021. handle them, and write the responses to the appropriate
  1022. file descriptors.
  1023. .PP
  1024. The
  1025. .CW disk
  1026. process handles disk I/O requests made by the other processes.
  1027. The
  1028. .CW flush
  1029. process writes dirty blocks from the in-memory block cache to disk.
  1030. The
  1031. .CW unlink
  1032. process frees previously linked blocks once the blocks that point at them
  1033. have been written to disk.
  1034. .PP
  1035. A
  1036. .CW consI
  1037. reads from each console file (typically a pipe posted in
  1038. .CW /srv ),
  1039. adding the typed characters to the input queue.
  1040. The
  1041. .CW cons
  1042. process echoes input and runs the commands, saving
  1043. output in a ring buffer.
  1044. Because there is only one
  1045. .CW cons
  1046. process, only one console command may be executing at a time.
  1047. A
  1048. .CW consO
  1049. process copies this ring buffer to each console file.
  1050. .PP
  1051. The
  1052. .CW periodic
  1053. process runs periodic events, like
  1054. flushing the root metadata to disk or
  1055. taking snapshots of the file system.
  1056. .NH 2
  1057. Block cache
  1058. .HP
  1059. Fossil maintains an in-memory block cache which
  1060. holds both local disk blocks and Venti blocks.
  1061. Cache eviction follows a least recently used policy.
  1062. Dirty blocks are restricted to at most half the cache.
  1063. This can be changed by editing
  1064. .CW DirtyPercentage
  1065. in
  1066. .CW dat.h .
  1067. .PP
  1068. The block cache uses soft updates [1] to ensure that the on-disk
  1069. file system is always self-consistent.
  1070. Thus there is no
  1071. .I halt
  1072. console command
  1073. and no need to check a file system
  1074. that was shut down without halting.
  1075. .NH 2
  1076. Archiving
  1077. .HP
  1078. A background process writes blocks in archival snapshots to Venti.
  1079. Although
  1080. .CW /archive/\fIyyyy\fP/\fImmdds\fR
  1081. is a copy of only
  1082. .CW /active
  1083. at the time of the snapshot,
  1084. the archival process archives the
  1085. entire file tree rather than just
  1086. the subtree rooted at
  1087. .CW /active .
  1088. The snapshots
  1089. .CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm
  1090. are stored as empty directories.
  1091. Once all the blocks have been archived,
  1092. a
  1093. .CW VtRoot
  1094. header for the file system is archived.
  1095. The score of that header is recorded in
  1096. .CW super.score
  1097. and also printed on the file server console.
  1098. The score can used by
  1099. .I flfmt
  1100. to restore a file system (see
  1101. .I fossil (4)).
  1102. .NH 2
  1103. Contrast with the old file server
  1104. .HP
  1105. The most obvious difference between Fossil and the
  1106. old Plan 9 file server [2] is that Fossil uses a Venti server as
  1107. its archival storage in place of a WORM juke box.
  1108. There are a few other architectural differences to be
  1109. aware of.
  1110. .PP
  1111. Fossil is a user-level program run on a standard kernel.
  1112. .PP
  1113. Fossil does not have any way to concatenate, stripe, or
  1114. mirror disk files. For functionality similar to the old file server's
  1115. configuration strings, use the experimental file stack device
  1116. (see
  1117. .I fs (3)).
  1118. .PP
  1119. Fossil speaks only 9P2000. Old 9P (aka 9P1) is not supported.
  1120. .PP
  1121. ... XXX words about converting an old file system to fossil?
  1122. .NH 1
  1123. References
  1124. .LP
  1125. [1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,
  1126. and Yale N. Patt.
  1127. ``Soft Updates: A Solution to the Metadata Update Problem
  1128. in File Systems,''
  1129. .I "ACM Transactions on Computer Systems" ,
  1130. Vol 18., No. 2, May 2000, pp. 127\-153.
  1131. .LP
  1132. [2] Sean Quinlan, ``A Cached WORM File System,''
  1133. .I "Software\(emPractice and Experience" ,
  1134. Vol 21., No 12., December 1991, pp. 1289\-1299.
  1135. .LP
  1136. [3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,''
  1137. .I "Usenix Conference on File and Storage Technologies" ,
  1138. 2002.