fossil.ms 31 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163
  1. .HTML "Fossil, an Archival File Server
  2. ... .FP times
  3. ... .fp 1 R R.nomath
  4. ... .fp 5 CW LucidaSansCW83
  5. .TL
  6. Fossil, an Archival File Server
  7. .AU
  8. Sean Quinlan
  9. Jim McKie
  10. Russ Cox
  11. jmk,rsc@plan9.bell-labs.com
  12. .AB
  13. This paper describes the internals and
  14. operation of Fossil, an archival file server built for Plan 9.
  15. Fossil has not yet replaced the current Plan 9 file server
  16. and
  17. .CW kfs ,
  18. but that is our eventual intent.
  19. Both fossil and this documentation are
  20. works in progress. Comments on either are most welcome.
  21. .AE
  22. .de HP
  23. .LP
  24. ..
  25. .NH 1
  26. Introduction
  27. .HP
  28. Fossil is an archival file server built for Plan 9.
  29. In a typical configuration, it maintains a traditional file system
  30. in a local disk partition and periodically archives snapshots of the file system
  31. to a Venti server. These archives are made available through
  32. a file system interface.
  33. Fossil can also be run without a Venti server, in which case the
  34. snapshots (if any) occupy local disk space.
  35. .PP
  36. The bulk of this paper explains the underlying data structures:
  37. Venti trees, the Venti archival file system format, and finally Fossil's
  38. file system format.
  39. The end of the paper discusses the architecture of the Fossil server.
  40. .PP
  41. The presentation of the data structures is very detailed, perhaps
  42. too detailed for most readers.
  43. The intent is to record all the details necessary to make structural
  44. changes to the file system format.
  45. Feel free to jump ahead when boredom strikes.
  46. .NH 1
  47. Venti trees and directory hierarchies
  48. .HP
  49. Venti [3] is an archival block storage server.
  50. Once a block is stored, it can be retrieved by presenting the 20-byte
  51. SHA1 hash of its contents, called a
  52. .I score .
  53. Blocks on Venti have a maximum length of about 56 kilobytes,
  54. though in practice smaller blocks are used.
  55. To store a byte stream of arbitrary length, Venti uses a hash tree.
  56. Conceptually, the data stream is broken into fixed-size (say,
  57. .I dsize -byte)
  58. chunks, which are stored on the Venti server.
  59. The resulting scores are concatenated into a new pointer stream, which is
  60. broken into fixed size (say,
  61. .I psize -byte)
  62. chunks, which are stored on the Venti server.
  63. .I Psize "" (
  64. is different from
  65. .I dsize
  66. so that we can ensure that each pointer block holds an
  67. integral number of pointers.)
  68. This yields a new pointer stream, and so on, until there is a single block
  69. and finally a single score describing the entire tree.
  70. The resulting structure looks like:
  71. .PS
  72. .ps 8
  73. .vs 10
  74. boxht=0.1
  75. boxwid=0.1
  76. B0: box invis wid 1 "\f(CWVtDataType\fP"
  77. move right 0.1
  78. L0a: box wid 0.2
  79. move right 0.1
  80. L0b: box wid 0.2
  81. move right 0.1
  82. L0c: box invis wid 0.2 "..."
  83. move right 0.1
  84. L0d: box wid 0.2
  85. move right 0.1
  86. L0e: box wid 0.2
  87. move right 0.2
  88. L0f: box invis wid 0.2 "..."
  89. move right 0.2
  90. L0g: box wid 0.2
  91. move right 0.1
  92. L0h: box wid 0.2
  93. move right 0.1
  94. L0i: box invis wid 0.2 "..."
  95. move right 0.1
  96. L0j: box wid 0.2
  97. move right 0.1
  98. L0k: box wid 0.2
  99. move right 0.1
  100. L0l: box invis wid 0.2 "..."
  101. move right 0.1
  102. L0m: box wid 0.2
  103. define boxppddd {
  104. line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
  105. line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
  106. X: box invis at 0.1<$1.nw,$1.ne>
  107. Y: box invis at 0.1<$1.sw,$1.se>
  108. line -> from 0.5<X,Y> to $2.nw
  109. X: box invis at 0.3<$1.nw,$1.ne>
  110. Y: box invis at 0.3<$1.sw,$1.se>
  111. line -> from 0.5<X,Y> to $3.nw
  112. "..." at 0.7<$1.w,$1.e>
  113. }
  114. define boxppdddp {
  115. line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
  116. line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
  117. line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
  118. X: box invis at 0.1<$1.nw,$1.ne>
  119. Y: box invis at 0.1<$1.sw,$1.se>
  120. line -> from 0.5<X,Y> to $2.nw
  121. X: box invis at 0.3<$1.nw,$1.ne>
  122. Y: box invis at 0.3<$1.sw,$1.se>
  123. line -> from 0.5<X,Y> to $3.nw
  124. "..." at 0.6<$1.w,$1.e>
  125. X: box invis at 0.9<$1.nw,$1.ne>
  126. Y: box invis at 0.9<$1.sw,$1.se>
  127. line -> from 0.5<X,Y> to $4.nw
  128. }
  129. define boxpdddp {
  130. line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
  131. line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
  132. X: box invis at 0.1<$1.nw,$1.ne>
  133. Y: box invis at 0.1<$1.sw,$1.se>
  134. line -> from 0.5<X,Y> to $2.nw
  135. "..." at 0.5<$1.w,$1.e>
  136. X: box invis at 0.9<$1.nw,$1.ne>
  137. Y: box invis at 0.9<$1.sw,$1.se>
  138. line -> from 0.5<X,Y> to $3.nw
  139. }
  140. bhd=0.4
  141. L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd)
  142. boxppddd(L1abc, L0a, L0b)
  143. L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd)
  144. boxppddd(L1def, L0d, L0e)
  145. L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd)
  146. boxppddd(L1ghi, L0g, L0h)
  147. L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd)
  148. boxppdddp(L1jklm, L0j, L0k, L0m)
  149. B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd)
  150. L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd)
  151. boxppddd(L2abcdef, L1abc, L1def)
  152. L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd)
  153. boxpdddp(L2ghijklm, L1ghi, L1jklm)
  154. B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd)
  155. L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd)
  156. boxpdddp(L3atom, L2abcdef, L2ghijklm)
  157. B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd)
  158. .PE
  159. .LP
  160. The leaves are the original data stream. Those blocks have type
  161. .CW VtDataType .
  162. The first pointer stream has type
  163. .CW VtPointerType0 ,
  164. the next has type
  165. .CW VtPointerType1 ,
  166. and so on.
  167. The figure ends with a single block of type
  168. .CW VtPointerType2 ,
  169. but in general trees can have height up to
  170. .CW VtPointerType6 .
  171. For a
  172. .I dsize
  173. of 8192 bytes
  174. and
  175. .I psize
  176. of 8180 bytes (409 pointers),
  177. this gives a maximum stream size of approximately 10 zettabytes
  178. (2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes).
  179. .PP
  180. Data blocks are truncated to remove trailing runs of zeros before
  181. storage to Venti; they are zero-filled back to
  182. .I dsize
  183. bytes after retrieval from Venti.
  184. Similarly, trailing runs of pointers to zero-length blocks are
  185. removed from and added back to pointer blocks.
  186. These simple rules happen to make it particularly efficient to store
  187. large runs of zeros, as might occur in a data stream with ``holes:''
  188. the zero-length block itself can be interpreted as a tree of any
  189. depth encoding an all-zero data stream.
  190. .PP
  191. Reconstructing the data stream requires the score and type of the
  192. topmost block in the tree, the data chunk size, the pointer chunk size,
  193. and the data stream size.
  194. (From the data stream size and the chunk sizes we could derive the
  195. depth of the tree and thus the type of the topmost block, but it is convenient
  196. to allow trees that are deeper than necessary.)
  197. This information is kept in a 40-byte structure called a
  198. .CW VtEntry :
  199. .P1
  200. VtEntry:
  201. .ta +\w' 'u +\w' 'u
  202. gen[4] \fRgeneration number\fP
  203. psize[2] \fRsize of pointer blocks\fP
  204. dsize[2] \fRsize of data blocks\fP
  205. flags[1]
  206. zero[5]
  207. size[6] \fRlength of file\fP
  208. score[20] \fRscore of root block in tree\fP
  209. .P2
  210. (In this notation,
  211. .CW name[sz]
  212. indicates a
  213. .CW sz -byte
  214. field called
  215. .CW name .
  216. Integers are stored in big-endian order.
  217. .CW Size
  218. really is a 48-bit field.)
  219. .CW Flags
  220. is made up of the following bit fields.
  221. .P1
  222. .ta +\w' 'u +\w' 'u
  223. 0x01 VtEntryActive \fRentry is allocated\fP
  224. 0x02 VtEntryDir \fRentry describes a Venti directory (q.v.)\fP
  225. 0x1C VtEntryDepthMask \fRmask for tree depth\fP
  226. 0x20 VtEntryLocal \fRreserved (q.v.)\fP
  227. .P2
  228. .LP
  229. The depth of the described tree is stored in the 3 bits indicated:
  230. a tree with a topmost node of type
  231. .CW VtPointerType3
  232. has depth 4.
  233. .PP
  234. With
  235. .CW VtEntry
  236. we can build more complicated data structures,
  237. ones with multiple or nested data streams.
  238. A data stream consisting of
  239. .CW VtEntry
  240. structures is called a Venti directory.
  241. It is identical in structure to the Venti data stream
  242. we described earlier except that the bottom-level type is
  243. .CW VtDirType ,
  244. and
  245. the
  246. .CW VtEntry
  247. describing a Venti directory has the
  248. .CW VtEntryDir
  249. flag bit set.
  250. The
  251. .I dsize
  252. for a Venti directory
  253. is a multiple of 40 so that each data chunk holds
  254. an integer number of
  255. .CW VtEntry
  256. structures.
  257. By analogy with Venti directories,
  258. we call the original data stream a
  259. Venti file.
  260. Note that Venti files are assumed
  261. .I not
  262. to contain pointers to other Venti blocks.
  263. The only pointers to Venti blocks occur in
  264. .CW VtEntry
  265. structures in
  266. Venti directories
  267. (and in the internal hash tree structure of the
  268. individual files and directories).
  269. Note also that these directories are nothing more than pointer lists.
  270. In particular, there are no names or metadata like in a file system.
  271. .PP
  272. To make it easier to pass hierarchies between applications,
  273. the root of a hierarchy is described in a 300-byte structure
  274. called a
  275. .CW VtRoot :
  276. .P1
  277. VtRoot:
  278. .ta +\w' 'u +\w' 'u
  279. version[2] \f(CW2\fP
  280. name[128] \fRname of structure (just a comment)\fP
  281. type[128] \fRstring describing structure (\f(CWvac\fR)\f(CW
  282. score[20] \fRpointer to \f(CWVtDirType\fP block\f(CW
  283. blockSize[2] \fRmaximum block size in structure\fP
  284. prev[20] \fRprevious \f(CWVtRoot\fP in chain, if any\fP
  285. .P2
  286. .LP
  287. This structure is stored to Venti and its score is passed
  288. between applications, typically in the form
  289. ``\fItype\f(CW:\fIrootscore\fR,''
  290. where
  291. .I type
  292. is the type field from the
  293. .CW VtRoot
  294. structure, and
  295. .I rootscore
  296. is the score of the
  297. .CW VtRoot
  298. block.
  299. .CW VtRoot
  300. structures can be chained together using the
  301. .I prev
  302. field to encode an archival history
  303. of the data structure.
  304. .PP
  305. For example, a small Venti hierarchy might look like:
  306. .PS
  307. .ps 8
  308. .vs 10
  309. boxwid=0.1
  310. boxht=0.1
  311. f=0.9
  312. mb=0.16
  313. VtRoot: [
  314. right
  315. B1: box
  316. move right 0.1
  317. "\f(CWVtRoot\fP" ljust
  318. ]
  319. Root: [
  320. right
  321. B1: box fill f
  322. B2: box fill f
  323. B3: box fill f
  324. move right 0.1
  325. ] with .nw at VtRoot.sw+(0.2,-.1)
  326. Level1: [
  327. RootMeta: [
  328. box wid mb
  329. ]
  330. MetaSource: [
  331. right
  332. B1: box wid 5*mb
  333. ] with .nw at RootMeta.sw+(0,-.1)
  334. Source: [
  335. right
  336. B1: box fill f
  337. B2: box fill f
  338. B3: box fill f
  339. B4: box fill f
  340. B5: box fill f
  341. B6: box fill f
  342. B7: box fill f
  343. B8: box fill f
  344. ] with .nw at MetaSource.sw+(0,-.1)
  345. SB1: box invis at Source.B1
  346. SB2: box invis at Source.B2
  347. SB3: box invis at Source.B3
  348. ] with .nw at Root.sw+(0.4,-.1)
  349. Level2: [
  350. MetaSource: [
  351. right
  352. B1: box wid 5*mb
  353. ]
  354. Source: [
  355. right
  356. B1: box fill f
  357. B2: box fill f
  358. B3: box fill f
  359. B4: box fill f
  360. B5: box fill f
  361. B6: box fill f
  362. B7: box fill f
  363. B8: box fill f
  364. ] with .nw at MetaSource.sw+(0,-.1)
  365. File: box wid 0.8 with .nw at Source.sw+(0,-.1)
  366. ] with .nw at Level1.sw+(0.6,-.1)
  367. line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w
  368. line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w
  369. line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w
  370. line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w
  371. line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w
  372. line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w
  373. line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w
  374. [
  375. KEY: box wid 1.5 invis "Key"
  376. line from KEY.sw to KEY.se
  377. k = -.1
  378. kk=0.5
  379. A: [
  380. box wid 4*boxwid
  381. "Venti file" ljust with .w at last box .w+(kk,0)
  382. ] with .nw at KEY.sw+(0,2*k)
  383. B: [
  384. box fill f
  385. "Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0)
  386. ] with .nw at A.sw+(0,k)
  387. C: [
  388. right
  389. CC: box fill f
  390. box fill f
  391. box fill f
  392. box fill f
  393. "Venti directory" ljust with .w at CC.w+(kk,0)
  394. ] with .nw at B.sw+(0,k)
  395. D: [
  396. line -> right 3*boxwid
  397. "Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
  398. ] with .nw at C.sw+(0,k)
  399. ] with .nw at VtRoot.nw+(3,0)
  400. .PE
  401. .LP
  402. Venti files are shown as white boxes, while directories are shown
  403. as shaded boxes. Each shaded square represents a
  404. .CW VtEntry .
  405. Arrows represent pointers from
  406. .CW VtEntry
  407. structures to other
  408. Venti files or directories.
  409. .PP
  410. The hierarchical structure provided by Venti files and directories
  411. can be used as the base for more complicated data structures.
  412. Because this structure captures all the information
  413. about pointers to other blocks, tools written to traverse
  414. Venti hierarchies can traverse the more complicated
  415. data structures as well.
  416. For example,
  417. .I venti/copy
  418. (see
  419. .I ventiaux (8))
  420. copies a Venti hierarchy from one Venti server to another,
  421. given the root
  422. .CW VtEntry .
  423. Because the traditional file system described in later sections is
  424. layered on a Venti hierarchy,
  425. .I venti/copy
  426. can copy it without fully understanding its structure.
  427. .NH 1
  428. Vac file system format
  429. .HP
  430. The Venti archive format
  431. .I vac
  432. builds a traditional file system using a Venti hierarchy.
  433. Each vac file is implemented as a Venti file;
  434. each vac directory is implemented as a Venti
  435. directory and a Venti file to provide traditional file system metadata.
  436. The metadata is stored in a structure called a
  437. .CW DirEntry :
  438. .P1
  439. DirEntry:
  440. .ta +\w' 'u +\w' 'u
  441. magic[4] \f(CW0x1c4d9072\fP (DirMagic)\fP
  442. version[2] \f(CW9\fP
  443. elem[s] \fRname (final path element only)\fP
  444. entry[4] \fRentry number for Venti file or directory\fP
  445. gen[4] \fRgeneration number\fP
  446. mentry[4] \fRentry number for Venti file holding metadata\fP
  447. mgen[4] \fRgeneration number\fP
  448. qid[8] \fRunique file serial number\fP
  449. uid[s] \fRowner\fP
  450. gid[s] \fRgroup\fP
  451. mid[s] \fRlast modified by\fP
  452. mtime[4] \fRlast modification time\fP
  453. ctime[4] \fRcreation time\fP
  454. atime[4] \fRlast access time\fP
  455. mode[4] \fRmode bits\fP
  456. .P2
  457. The notation
  458. .CW name[s]
  459. denotes a string stored as a two-byte length
  460. and then that many bytes.
  461. The above describes Version 9 of the
  462. .CW DirEntry
  463. format. Versions 7 and 8 are very similar; they can be
  464. read by the current
  465. .I vac
  466. source code but are not written.
  467. Earlier versions were not widespread.
  468. A
  469. .CW DirEntry
  470. may be followed by optional extension sections, though none
  471. are currently used.
  472. The
  473. .CW mode
  474. bits include bits commonly used by
  475. Unix and Windows, in addition to those used by Plan 9.
  476. .PP
  477. The
  478. .CW entry
  479. field is an index into the parallel Venti directory.
  480. The
  481. .CW gen
  482. field must match the
  483. .CW gen
  484. field in the corresponding
  485. .CW VtEntry
  486. in the directory;
  487. it is used to detect
  488. stale indices.
  489. Similarly,
  490. .CW mentry
  491. and
  492. .CW mgen
  493. are the index and generation number
  494. for the metadata Venti file,
  495. if the
  496. .CW DirEntry
  497. describes a vac directory.
  498. .PP
  499. The relation between Venti files and directories and
  500. vac files and directories can be seen in this figure:
  501. .PS
  502. .ps 8
  503. .vs 10
  504. boxwid=0.1
  505. boxht=0.1
  506. f=0.9
  507. mb=0.16
  508. VtRoot: [
  509. right
  510. B1: box
  511. move right 0.1
  512. "\f(CWVtRoot\fP" ljust
  513. ]
  514. SuperRoot: [
  515. right
  516. B1: box fill f
  517. move right 0.1
  518. "fs root block" ljust
  519. ] with .nw at VtRoot.sw + (0.2, -.2)
  520. Root: [
  521. right
  522. B1: box fill f
  523. B2: box fill f
  524. B3: box fill f
  525. move right 0.1
  526. "root directory info block" ljust
  527. ] with .nw at SuperRoot.sw+(0.2, -.2)
  528. Level1: [
  529. RootMeta: [
  530. box wid mb
  531. move right 0.1
  532. "root metadata" ljust
  533. ]
  534. MetaSource: [
  535. right
  536. B1: box wid mb
  537. B2: box wid mb
  538. B3: box wid mb
  539. B4: box wid mb
  540. B5: box wid mb
  541. ] with .nw at RootMeta.sw+(0,-.2)
  542. MB1: box wid mb invis at MetaSource.B1
  543. MB2: box wid mb invis at MetaSource.B2
  544. MB3: box wid mb invis at MetaSource.B3
  545. MB4: box wid mb invis at MetaSource.B4
  546. MB5: box wid mb invis at MetaSource.B5
  547. Source: [
  548. right
  549. B1: box fill f
  550. B2: box fill f
  551. B3: box fill f
  552. B4: box fill f
  553. B5: box fill f
  554. B6: box fill f
  555. B7: box fill f
  556. B8: box fill f
  557. ] with .nw at MetaSource.sw+(0,-.1)
  558. SB1: box invis at Source.B1
  559. SB2: box invis at Source.B2
  560. SB3: box invis at Source.B3
  561. SB4: box invis at Source.B4
  562. SB5: box invis at Source.B5
  563. SB6: box invis at Source.B6
  564. SB7: box invis at Source.B7
  565. SB8: box invis at Source.B8
  566. ] with .nw at Root.sw+(0.4,-.2)
  567. Level2: [
  568. MetaSource: [
  569. right
  570. B1: box wid mb
  571. B2: box wid mb
  572. B3: box wid mb
  573. B4: box wid mb
  574. B5: box wid mb
  575. ]
  576. Source: [
  577. right
  578. B1: box fill f
  579. B2: box fill f
  580. B3: box fill f
  581. B4: box fill f
  582. B5: box fill f
  583. B6: box fill f
  584. B7: box fill f
  585. B8: box fill f
  586. ] with .nw at MetaSource.sw+(0,-.1)
  587. File: box wid 0.8 with .nw at Source.sw+(0,-.2)
  588. ] with .nw at Level1.sw+(0.6,-.2)
  589. line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w
  590. line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w
  591. line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w
  592. line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w
  593. line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w
  594. line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w
  595. line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w
  596. line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w
  597. arrowwid = arrowwid/2
  598. arrowht = arrowht/2
  599. line -> from Level1.MB1 to Level1.SB1.n
  600. line -> from Level1.MB2 to Level1.SB2.n
  601. line -> from Level1.MB2 to Level1.SB3.n
  602. line -> from Level1.MB4 to Level1.SB7.n
  603. line -> from Level1.MB5 to Level1.SB5.n
  604. arrowwid = arrowwid * 2
  605. arrowht = arrowht * 2
  606. box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
  607. box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
  608. box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2
  609. [
  610. KEY: box wid 1.5 invis "Key"
  611. line from KEY.sw to KEY.se
  612. k = -.1
  613. kk=0.5
  614. A: [
  615. box wid 4*boxwid
  616. "Venti file" ljust with .w at last box .w+(kk,0)
  617. ] with .nw at KEY.sw+(0,2*k)
  618. B: [
  619. box fill f
  620. "Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0)
  621. ] with .nw at A.sw+(0,k)
  622. C: [
  623. right
  624. CC: box fill f
  625. box fill f
  626. box fill f
  627. box fill f
  628. "Venti directory" ljust with .w at CC.w+(kk,0)
  629. ] with .nw at B.sw+(0,k)
  630. D: [
  631. line -> right 3*boxwid
  632. "Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
  633. ] with .nw at C.sw+(0,k)
  634. DD: [
  635. box dotted wid 4*boxwid
  636. "Vac file" ljust with .w at last box .w+(kk,0)
  637. ] with .nw at D.sw+(0,k)
  638. E: [
  639. box wid mb
  640. "Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0)
  641. ] with .nw at DD.sw+(0,k)
  642. G: [
  643. box dashed wid 4*boxwid
  644. "Vac directory" ljust with .w at last box .w+(kk,0)
  645. ] with .nw at E.sw+(0,k)
  646. H: [
  647. arrowwid = arrowwid/2
  648. arrowht = arrowht/2
  649. line -> right 1.5*boxwid
  650. "Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0)
  651. arrowwid = arrowwid * 2
  652. arrowht = arrowht * 2
  653. ] with .nw at G.sw+(0,k)
  654. ] with .nw at VtRoot.nw+(3,0)
  655. .PE
  656. .LP
  657. In reality, the story is slightly more complicated.
  658. The metadata file in a Vac directory
  659. is not just the concatenation of
  660. .CW DirEntry
  661. structures.
  662. Instead, it is the concatenation of
  663. .CW MetaBlocks .
  664. A
  665. .CW MetaBlock
  666. contains some number of
  667. .CW DirEntry
  668. structures along with a sorted index to make it easy
  669. to look for a particular
  670. .CW DirEntry
  671. by its
  672. .CW elem
  673. field.
  674. The details are in the source code.
  675. .PP
  676. As shown in the diagram,
  677. the root directory of the file system is summarized by
  678. three
  679. .CW VtEntry
  680. structures describing
  681. the Venti directory for the children of the root,
  682. the Venti file for the metadata describing the children of the root,
  683. and a Venti file holding metadata for the root directory itself.
  684. These
  685. .CW VtEntry
  686. structures are placed in a Venti directory of their own,
  687. described by the single
  688. .CW VtEntry
  689. in the
  690. root block.
  691. .NH 1
  692. Fossil file system format
  693. .HP
  694. Fossil uses the vac format, with some small changes.
  695. The changes only affect the data on the local disk; the data
  696. archived to Venti is exactly in vac format.
  697. .PP
  698. Blocks stored on local disk may contain scores pointing at local disk
  699. blocks or at Venti blocks.
  700. Local block addresses are stored as 20-byte scores in which the first 16 bytes
  701. are all zero and the last 4 bytes specify a block number in the disk.
  702. Before a block is archived, all the
  703. blocks it points to must be archived, and the local scores in the block
  704. must be changed to Venti scores.
  705. Using block addresses rather than content hashes for local data
  706. makes the local file system easier to manage: if a local block's contents
  707. change, the pointer to the block does not need to change.
  708. .NH 2
  709. Snapshots
  710. .HP
  711. Fossil is an archival file server.
  712. It takes periodic snapshots of the file system,
  713. which are made accessible through the file system.
  714. Specifically, the active file system is presented in
  715. .CW /active .
  716. Ephemeral snapshots (those that are kept on local disk and eventually deleted)
  717. are presented in
  718. \f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR,
  719. where
  720. .I yyyy
  721. is the full year,
  722. .I mm
  723. is the month number,
  724. .I dd
  725. is the day number,
  726. .I hh
  727. is the hour,
  728. and
  729. .I mm
  730. is the minute.
  731. Archival snapshots (those that are archived to Venti and persist forever)
  732. are presented in
  733. \f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR,
  734. where
  735. .I yyyy ,
  736. .I mm ,
  737. and
  738. .I dd
  739. are year, month, and day as before,
  740. and
  741. .I s
  742. is a sequence number if more than one
  743. archival snapshot is done in a day.
  744. For the first snapshot,
  745. .I s
  746. is null.
  747. For the subsequent snapshots,
  748. .I s
  749. is
  750. .CW .1 ,
  751. .CW .2 ,
  752. .CW .3 ,
  753. etc.
  754. .PP
  755. To implement the snapshots, the file server maintains a
  756. current
  757. .I epoch
  758. for the active file system.
  759. Each local block has a label that records, among other things,
  760. the epoch in which the block was allocated.
  761. If a block was allocated in an epoch earlier than the current one,
  762. it is immutable and treated as copy-on-write.
  763. Taking a snapshot can be accomplished by
  764. recording the address of the current root block and then
  765. incrementing the epoch number.
  766. Notice that the copy-on-write method makes
  767. snapshots both time efficient and space efficient.
  768. The only time cost is waiting for all current file system
  769. requests to finish and then incrementing a counter.
  770. After a snapshot, blocks only get copied when they are
  771. next modified, so the per-snapshot
  772. space requirement is proportional
  773. to the amount of new data rather than the total
  774. size of the file system.
  775. .PP
  776. The blocks in the archival snapshots are moved to Venti,
  777. but the blocks in the ephemeral snapshots take up space
  778. in the local disk file.
  779. To allow reclamation of this disk space, the file system
  780. maintains a
  781. .I low
  782. .I epoch ,
  783. which is the epoch of the earliest ephemeral snapshot
  784. still available.
  785. Fossil only allows access to snapshots with epoch numbers
  786. between the
  787. low epoch and the current epoch
  788. (also called the high epoch).
  789. Incrementing the low epoch thus makes old
  790. snapshots inaccessible.
  791. The space required to store those snapshots can then
  792. be reclaimed, as described below.
  793. .NH 2
  794. Local blocks
  795. .HP
  796. The bulk of the local disk file is the local blocks.
  797. Each block has a 14-byte label associated with it, of the format:
  798. .P1
  799. Label:
  800. .ta +\w' 'u +\w' 'u
  801. state[1] \fRblock state\fP
  802. type[1] \fRblock type\fP
  803. epoch[4] \fRallocation epoch\fP
  804. epochClose[4] \fRclose epoch\fP
  805. tag[4] \fRrandom tag\fP
  806. .P2
  807. .LP
  808. The
  809. .CW type
  810. is an analogue of the block types described earlier,
  811. though different names are used, to distinguish between
  812. pointers blocks in a hash tree for a data stream
  813. and pointer blocks for a directory stream.
  814. The
  815. .CW epoch
  816. was mentioned in the last section.
  817. The other fields are explained below.
  818. .PP
  819. There are two distinguished blocks states
  820. .CW BsFree
  821. .CW 0x00 ) (
  822. and
  823. .CW BsBad
  824. .CW 0xFF ), (
  825. which mark blocks that are available for allocation
  826. and blocks that are bad and should be avoided.
  827. If
  828. .CW state
  829. is not one of these values, it is a bitwise
  830. .I or ' `
  831. of the following flags:
  832. .P1
  833. .ta +\w' 'u +\w' 'u
  834. 0x01 BsAlloc \fRblock is in use\fP
  835. 0x02 BsCopied \fRblock has been copied\fP
  836. 0x04 BsVenti \fRblock has been stored on Venti\fP
  837. 0x08 BsClosed \fRblock has been unlinked from active file system\fP
  838. .P2
  839. .LP
  840. The flags are explained as they arise in the discussions below.
  841. .PP
  842. It is convenient to store some extra fields in the
  843. .CW VtEntry
  844. structure when it describes a Venti file or directory
  845. stored on local disk.
  846. Specifically, we set the
  847. .CW VtEntryLocal
  848. flag bit
  849. and then use the bytes 7-16 of the score (which would
  850. otherwise be zero, since it is a local score) to hold these fields:
  851. .P1
  852. .ta +\w' 'u +\w' 'u
  853. archive[1] \fRboolean: this is an archival snapshot\fP
  854. snap[4] \fRepoch number if root of snapshot\fP
  855. tag[4] \fRrandom tag\fP
  856. .P2
  857. .LP
  858. The extended
  859. .CW VtEntry
  860. structure is called an
  861. .CW Entry .
  862. The
  863. .CW tag
  864. field
  865. in the
  866. .CW Label
  867. and the
  868. .CW Entry
  869. is used to identify dangling pointers or other file system corruption:
  870. all the local blocks in a hash tree must
  871. have tags matching the tag in the
  872. .CW Entry .
  873. If this
  874. .CW Entry
  875. points at the root of a snapshot,
  876. the
  877. .CW snap
  878. field is the epoch of the snapshot.
  879. If the snapshot is intended to be archived to Venti,
  880. the
  881. .CW archive
  882. field is non-zero.
  883. .NH 2
  884. Block reclamation
  885. .HP
  886. The blocks in the active file system form a tree: each
  887. block has only one parent.
  888. Once a copy-on-write block
  889. .I b
  890. is replaced by its copy, it is no longer
  891. needed by the active file system.
  892. At this point,
  893. .I b
  894. is unlinked from the active file system.
  895. We say that
  896. .I b
  897. is now
  898. .I closed :
  899. it is needed only for snapshots.
  900. When a block is closed, the
  901. .CW BsClosed
  902. bit is set in its state, and the current epoch (called the block's closing epoch)
  903. is stored in the
  904. .CW epochClose
  905. label field.
  906. (Open blocks have an
  907. .CW epochClose
  908. of
  909. .CW ~0 ).
  910. .PP
  911. A block is referenced by snapshots with epochs
  912. between the block's allocation epoch and its closing epoch.
  913. Once the file system's low epoch grows to be greater than or equal to the block's
  914. closing epoch, the block is no longer needed for any snapshots
  915. and can be reused.
  916. .PP
  917. In a typical configuration, where nightly archival snapshots
  918. are taken and written to Venti, it is desirable to reclaim
  919. the space occupied by now-archived blocks if possible.
  920. To do this, Fossil keeps track of whether the pointers
  921. in each block are unique to that block.
  922. When a block
  923. .I bb
  924. is allocated, a pointer to
  925. .I bb
  926. is written into exactly one active block (say,
  927. .I b ).
  928. In the absence of snapshots, the pointer to
  929. .I bb
  930. will remain unique to
  931. .I b ,
  932. so that if the pointer is zeroed,
  933. .I bb
  934. can be immediately reused.
  935. Snapshots complicate this invariant:
  936. when
  937. .I b
  938. is copied-on-write, all its pointers
  939. are no longer unique to it.
  940. At time of the copy, the
  941. .CW BsCopied
  942. state bit in the block's label
  943. is set to note the duplication of the pointers contained within.
  944. .NH 2
  945. Disk layout
  946. .HP
  947. The file system header describes the file system layout and has this format:
  948. .P1
  949. .ta +\w' 'u +\w' 'u
  950. Header:
  951. magic[4] \fR0x3776AE89 (HeaderMagic)\fP
  952. version[2] \fR1 (HeaderVersion)\fP
  953. blockSize[2] \fIfile system block size\fP
  954. super[4] \fRblock offset of super block\fP
  955. label[4] \fRblock offset of labels\fP
  956. data[4] \fRdata blocks\fP
  957. end[4] \fRend of file system\fP
  958. .P2
  959. .LP
  960. The corresponding file system layout is:
  961. .PS
  962. .ps 8
  963. .vs 9
  964. boxwid=0.75
  965. boxht=0.15
  966. Empty: box "empty" ht 0.25
  967. Header: box "header" with .n at Empty.s
  968. Empty2: box "empty" with .n at Header.s
  969. Super: box "super block" with .n at Empty2.s
  970. Label: box "label" "blocks" with .n at Super.s ht 0.25
  971. Data: box "data" "blocks" with .n at Label.s ht 0.3
  972. " 0" ljust at Empty.ne
  973. " 128kB" ljust at Header.ne
  974. " \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne
  975. " \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne
  976. " \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne
  977. " \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se
  978. "" at (-1,0)
  979. "" at (6,0)
  980. .PE
  981. .LP
  982. The numbers to the right of the blocks are byte offsets
  983. of the boundaries.
  984. .LP
  985. The super block describes the file system itself and looks like:
  986. .P1
  987. .ta +\w' 'u +\w' 'u
  988. Super:
  989. magic[4] \fR0x2340A3B1 (SuperMagic)\fP
  990. version[2] \fR1 (SuperVersion)\fP
  991. epochLow[4] \fRfile system low epoch\fP
  992. epochHigh[4] \fRfile system high (active) epoch\fP
  993. qid[8] \fRnext qid to allocate\fP
  994. active[4] \fRdata block number: root of active file system\fP
  995. next[4] \fRdata block number: root of next file system to archive\fP
  996. current[4] \fRdata block number: root of file system currently being archived\fP
  997. last[20] \fRVenti score of last successful archive\fP
  998. name[128] \fRname of file system (just a comment)\fP
  999. .P2
  1000. .LP
  1001. .NH 1
  1002. Fossil server
  1003. .HP
  1004. The Fossil server is a user-space program that runs on a standard Plan 9 kernel.
  1005. .NH 2
  1006. Process structure
  1007. .PP
  1008. The file server is structured as a set of processes synchronizing
  1009. mostly through message passing along queues.
  1010. The processes are given names, which can be seen in the output of
  1011. .CW ps
  1012. .CW -a .
  1013. .PP
  1014. .CW Listen
  1015. processes announce on various network addresses.
  1016. A
  1017. .CW con
  1018. process handles each incoming connection, reading 9P requests
  1019. and adding them to a central message queue.
  1020. .CW Msg
  1021. processes remove 9P requests from the queue,
  1022. handle them, and write the responses to the appropriate
  1023. file descriptors.
  1024. .PP
  1025. The
  1026. .CW disk
  1027. process handles disk I/O requests made by the other processes.
  1028. The
  1029. .CW flush
  1030. process writes dirty blocks from the in-memory block cache to disk.
  1031. The
  1032. .CW unlink
  1033. process frees previously linked blocks once the blocks that point at them
  1034. have been written to disk.
  1035. .PP
  1036. A
  1037. .CW consI
  1038. reads from each console file (typically a pipe posted in
  1039. .CW /srv ),
  1040. adding the typed characters to the input queue.
  1041. The
  1042. .CW cons
  1043. process echoes input and runs the commands, saving
  1044. output in a ring buffer.
  1045. Because there is only one
  1046. .CW cons
  1047. process, only one console command may be executing at a time.
  1048. A
  1049. .CW consO
  1050. process copies this ring buffer to each console file.
  1051. .PP
  1052. The
  1053. .CW periodic
  1054. process runs periodic events, like
  1055. flushing the root metadata to disk or
  1056. taking snapshots of the file system.
  1057. .NH 2
  1058. Block cache
  1059. .HP
  1060. Fossil maintains an in-memory block cache which
  1061. holds both local disk blocks and Venti blocks.
  1062. Cache eviction follows a least recently used policy.
  1063. Dirty blocks are restricted to at most half the cache.
  1064. This can be changed by editing
  1065. .CW DirtyPercentage
  1066. in
  1067. .CW dat.h .
  1068. .PP
  1069. The block cache uses soft updates [1] to ensure that the on-disk
  1070. file system is always self-consistent.
  1071. Thus there is no
  1072. .I halt
  1073. console command
  1074. and no need to check a file system
  1075. that was shut down without halting.
  1076. .NH 2
  1077. Archiving
  1078. .HP
  1079. A background process writes blocks in archival snapshots to Venti.
  1080. Although
  1081. .CW /archive/\fIyyyy\fP/\fImmdds\fR
  1082. is a copy of only
  1083. .CW /active
  1084. at the time of the snapshot,
  1085. the archival process archives the
  1086. entire file tree rather than just
  1087. the subtree rooted at
  1088. .CW /active .
  1089. The snapshots
  1090. .CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm
  1091. are stored as empty directories.
  1092. Once all the blocks have been archived,
  1093. a
  1094. .CW VtRoot
  1095. header for the file system is archived.
  1096. The score of that header is recorded in
  1097. .CW super.score
  1098. and also printed on the file server console.
  1099. The score can used by
  1100. .I flfmt
  1101. to restore a file system (see
  1102. .I fossil (4)).
  1103. .NH 2
  1104. Contrast with the old file server
  1105. .HP
  1106. The most obvious difference between Fossil and the
  1107. old Plan 9 file server [2] is that Fossil uses a Venti server as
  1108. its archival storage in place of a WORM juke box.
  1109. There are a few other architectural differences to be
  1110. aware of.
  1111. .PP
  1112. Fossil is a user-level program run on a standard kernel.
  1113. .PP
  1114. Fossil does not have any way to concatenate, stripe, or
  1115. mirror disk files. For functionality similar to the old file server's
  1116. configuration strings, use the experimental file stack device
  1117. (see
  1118. .I fs (3)).
  1119. .PP
  1120. Fossil speaks only 9P2000. Old 9P (aka 9P1) is not supported.
  1121. .PP
  1122. ... XXX words about converting an old file system to fossil?
  1123. .NH 1
  1124. References
  1125. .LP
  1126. [1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,
  1127. and Yale N. Patt.
  1128. ``Soft Updates: A Solution to the Metadata Update Problem
  1129. in File Systems,''
  1130. .I "ACM Transactions on Computer Systems" ,
  1131. Vol 18., No. 2, May 2000, pp. 127\-153.
  1132. .LP
  1133. [2] Sean Quinlan, ``A Cached WORM File System,''
  1134. .I "Software\(emPractice and Experience" ,
  1135. Vol 21., No 12., December 1991, pp. 1289\-1299.
  1136. .LP
  1137. [3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,''
  1138. .I "Usenix Conference on File and Storage Technologies" ,
  1139. 2002.