12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916016116216316416516616716816917017117217317417517617717817918018118218318418518618718818919019119219319419519619719819920020120220320420520620720820921021121221321421521621721821922022122222322422522622722822923023123223323423523623723823924024124224324424524624724824925025125225325425525625725825926026126226326426526626726826927027127227327427527627727827928028128228328428528628728828929029129229329429529629729829930030130230330430530630730830931031131231331431531631731831932032132232332432532632732832933033133233333433533633733833934034134234334434534634734834935035135235335435535635735835936036136236336436536636736836937037137237337437537637737837938038138238338438538638738838939039139239339439539639739839940040140240340440540640740840941041141241341441541641741841942042142242342442542642742842943043143243343443543643743843944044144244344444544644744844945045145245345445545645745845946046146246346446546646746846947047147247347447547647747847948048148248348448548648748848949049149249349449549649749849950050150250350450550650750850951051151251351451551651751851952052152252352452552652752852953053153253353453553653753853954054154254354454554654754854955055155255355455555655755855956056156256356456556656756856957057157257357457557657757857958058158258358458558658758858959059159259359459559659759859960060160260360460560660760860961061161261361461561661761861962062162262362462562662762862963063163263363463563663763863964064164264364464564664764864965065165265365465565665765865966066166266366466566666766866967067167267367467567667767867968068168268368468568668768868969069169269369469569669769869970070170270370470570670770870971071171271371471571671771871972072172272372472572672772872973073173273373473573673773873974074174274374474574674774874975075175275375475575675775875976076176276376476576676776876977077177277377477577677777877978078178278378478578678778878979079179279379479579679779879980080180280380480580680780880981081181281381481581681781881982082182282382482582682782882983083183283383483583683783883984084184284384484584684784884985085185285385485585685785885986086186286386486586686786886987087187287387487587687787887988088188288388488588688788888989089189289389489589689789889990090190290390490590690790890991091191291391491591691791891992092192292392492592692792892993093193293393493593693793893994094194294394494594694794894995095195295395495595695795895996096196296396496596696796896997097197297397497597697797897998098198298398498598698798898999099199299399499599699799899910001001100210031004100510061007100810091010101110121013101410151016101710181019102010211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064106510661067106810691070107110721073107410751076107710781079108010811082108310841085108610871088108910901091109210931094109510961097109810991100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163 |
- .HTML "Fossil, an Archival File Server
- ... .FP times
- ... .fp 1 R R.nomath
- ... .fp 5 CW LucidaSansCW83
- .TL
- Fossil, an Archival File Server
- .AU
- Sean Quinlan
- Jim McKie
- Russ Cox
- jmk,rsc@plan9.bell-labs.com
- .AB
- This paper describes the internals and
- operation of Fossil, an archival file server built for Plan 9.
- Fossil has not yet replaced the current Plan 9 file server
- and
- .CW kfs ,
- but that is our eventual intent.
- Both fossil and this documentation are
- works in progress. Comments on either are most welcome.
- .AE
- .de HP
- .LP
- ..
- .NH 1
- Introduction
- .HP
- Fossil is an archival file server built for Plan 9.
- In a typical configuration, it maintains a traditional file system
- in a local disk partition and periodically archives snapshots of the file system
- to a Venti server. These archives are made available through
- a file system interface.
- Fossil can also be run without a Venti server, in which case the
- snapshots (if any) occupy local disk space.
- .PP
- The bulk of this paper explains the underlying data structures:
- Venti trees, the Venti archival file system format, and finally Fossil's
- file system format.
- The end of the paper discusses the architecture of the Fossil server.
- .PP
- The presentation of the data structures is very detailed, perhaps
- too detailed for most readers.
- The intent is to record all the details necessary to make structural
- changes to the file system format.
- Feel free to jump ahead when boredom strikes.
- .NH 1
- Venti trees and directory hierarchies
- .HP
- Venti [3] is an archival block storage server.
- Once a block is stored, it can be retrieved by presenting the 20-byte
- SHA1 hash of its contents, called a
- .I score .
- Blocks on Venti have a maximum length of about 56 kilobytes,
- though in practice smaller blocks are used.
- To store a byte stream of arbitrary length, Venti uses a hash tree.
- Conceptually, the data stream is broken into fixed-size (say,
- .I dsize -byte)
- chunks, which are stored on the Venti server.
- The resulting scores are concatenated into a new pointer stream, which is
- broken into fixed size (say,
- .I psize -byte)
- chunks, which are stored on the Venti server.
- .I Psize "" (
- is different from
- .I dsize
- so that we can ensure that each pointer block holds an
- integral number of pointers.)
- This yields a new pointer stream, and so on, until there is a single block
- and finally a single score describing the entire tree.
- The resulting structure looks like:
- .PS
- .ps 8
- .vs 10
- boxht=0.1
- boxwid=0.1
- B0: box invis wid 1 "\f(CWVtDataType\fP"
- move right 0.1
- L0a: box wid 0.2
- move right 0.1
- L0b: box wid 0.2
- move right 0.1
- L0c: box invis wid 0.2 "..."
- move right 0.1
- L0d: box wid 0.2
- move right 0.1
- L0e: box wid 0.2
- move right 0.2
- L0f: box invis wid 0.2 "..."
- move right 0.2
- L0g: box wid 0.2
- move right 0.1
- L0h: box wid 0.2
- move right 0.1
- L0i: box invis wid 0.2 "..."
- move right 0.1
- L0j: box wid 0.2
- move right 0.1
- L0k: box wid 0.2
- move right 0.1
- L0l: box invis wid 0.2 "..."
- move right 0.1
- L0m: box wid 0.2
- define boxppddd {
- line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
- line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
- X: box invis at 0.1<$1.nw,$1.ne>
- Y: box invis at 0.1<$1.sw,$1.se>
- line -> from 0.5<X,Y> to $2.nw
- X: box invis at 0.3<$1.nw,$1.ne>
- Y: box invis at 0.3<$1.sw,$1.se>
- line -> from 0.5<X,Y> to $3.nw
- "..." at 0.7<$1.w,$1.e>
- }
- define boxppdddp {
- line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
- line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
- line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
- X: box invis at 0.1<$1.nw,$1.ne>
- Y: box invis at 0.1<$1.sw,$1.se>
- line -> from 0.5<X,Y> to $2.nw
- X: box invis at 0.3<$1.nw,$1.ne>
- Y: box invis at 0.3<$1.sw,$1.se>
- line -> from 0.5<X,Y> to $3.nw
- "..." at 0.6<$1.w,$1.e>
- X: box invis at 0.9<$1.nw,$1.ne>
- Y: box invis at 0.9<$1.sw,$1.se>
- line -> from 0.5<X,Y> to $4.nw
- }
- define boxpdddp {
- line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
- line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
- X: box invis at 0.1<$1.nw,$1.ne>
- Y: box invis at 0.1<$1.sw,$1.se>
- line -> from 0.5<X,Y> to $2.nw
- "..." at 0.5<$1.w,$1.e>
- X: box invis at 0.9<$1.nw,$1.ne>
- Y: box invis at 0.9<$1.sw,$1.se>
- line -> from 0.5<X,Y> to $3.nw
- }
- bhd=0.4
- L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd)
- boxppddd(L1abc, L0a, L0b)
- L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd)
- boxppddd(L1def, L0d, L0e)
- L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd)
- boxppddd(L1ghi, L0g, L0h)
- L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd)
- boxppdddp(L1jklm, L0j, L0k, L0m)
- B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd)
- L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd)
- boxppddd(L2abcdef, L1abc, L1def)
- L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd)
- boxpdddp(L2ghijklm, L1ghi, L1jklm)
- B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd)
- L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd)
- boxpdddp(L3atom, L2abcdef, L2ghijklm)
- B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd)
- .PE
- .LP
- The leaves are the original data stream. Those blocks have type
- .CW VtDataType .
- The first pointer stream has type
- .CW VtPointerType0 ,
- the next has type
- .CW VtPointerType1 ,
- and so on.
- The figure ends with a single block of type
- .CW VtPointerType2 ,
- but in general trees can have height up to
- .CW VtPointerType6 .
- For a
- .I dsize
- of 8192 bytes
- and
- .I psize
- of 8180 bytes (409 pointers),
- this gives a maximum stream size of approximately 10 zettabytes
- (2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes).
- .PP
- Data blocks are truncated to remove trailing runs of zeros before
- storage to Venti; they are zero-filled back to
- .I dsize
- bytes after retrieval from Venti.
- Similarly, trailing runs of pointers to zero-length blocks are
- removed from and added back to pointer blocks.
- These simple rules happen to make it particularly efficient to store
- large runs of zeros, as might occur in a data stream with ``holes:''
- the zero-length block itself can be interpreted as a tree of any
- depth encoding an all-zero data stream.
- .PP
- Reconstructing the data stream requires the score and type of the
- topmost block in the tree, the data chunk size, the pointer chunk size,
- and the data stream size.
- (From the data stream size and the chunk sizes we could derive the
- depth of the tree and thus the type of the topmost block, but it is convenient
- to allow trees that are deeper than necessary.)
- This information is kept in a 40-byte structure called a
- .CW VtEntry :
- .P1
- VtEntry:
- .ta +\w' 'u +\w' 'u
- gen[4] \fRgeneration number\fP
- psize[2] \fRsize of pointer blocks\fP
- dsize[2] \fRsize of data blocks\fP
- flags[1]
- zero[5]
- size[6] \fRlength of file\fP
- score[20] \fRscore of root block in tree\fP
- .P2
- (In this notation,
- .CW name[sz]
- indicates a
- .CW sz -byte
- field called
- .CW name .
- Integers are stored in big-endian order.
- .CW Size
- really is a 48-bit field.)
- .CW Flags
- is made up of the following bit fields.
- .P1
- .ta +\w' 'u +\w' 'u
- 0x01 VtEntryActive \fRentry is allocated\fP
- 0x02 VtEntryDir \fRentry describes a Venti directory (q.v.)\fP
- 0x1C VtEntryDepthMask \fRmask for tree depth\fP
- 0x20 VtEntryLocal \fRreserved (q.v.)\fP
- .P2
- .LP
- The depth of the described tree is stored in the 3 bits indicated:
- a tree with a topmost node of type
- .CW VtPointerType3
- has depth 4.
- .PP
- With
- .CW VtEntry
- we can build more complicated data structures,
- ones with multiple or nested data streams.
- A data stream consisting of
- .CW VtEntry
- structures is called a Venti directory.
- It is identical in structure to the Venti data stream
- we described earlier except that the bottom-level type is
- .CW VtDirType ,
- and
- the
- .CW VtEntry
- describing a Venti directory has the
- .CW VtEntryDir
- flag bit set.
- The
- .I dsize
- for a Venti directory
- is a multiple of 40 so that each data chunk holds
- an integer number of
- .CW VtEntry
- structures.
- By analogy with Venti directories,
- we call the original data stream a
- Venti file.
- Note that Venti files are assumed
- .I not
- to contain pointers to other Venti blocks.
- The only pointers to Venti blocks occur in
- .CW VtEntry
- structures in
- Venti directories
- (and in the internal hash tree structure of the
- individual files and directories).
- Note also that these directories are nothing more than pointer lists.
- In particular, there are no names or metadata like in a file system.
- .PP
- To make it easier to pass hierarchies between applications,
- the root of a hierarchy is described in a 300-byte structure
- called a
- .CW VtRoot :
- .P1
- VtRoot:
- .ta +\w' 'u +\w' 'u
- version[2] \f(CW2\fP
- name[128] \fRname of structure (just a comment)\fP
- type[128] \fRstring describing structure (\f(CWvac\fR)\f(CW
- score[20] \fRpointer to \f(CWVtDirType\fP block\f(CW
- blockSize[2] \fRmaximum block size in structure\fP
- prev[20] \fRprevious \f(CWVtRoot\fP in chain, if any\fP
- .P2
- .LP
- This structure is stored to Venti and its score is passed
- between applications, typically in the form
- ``\fItype\f(CW:\fIrootscore\fR,''
- where
- .I type
- is the type field from the
- .CW VtRoot
- structure, and
- .I rootscore
- is the score of the
- .CW VtRoot
- block.
- .CW VtRoot
- structures can be chained together using the
- .I prev
- field to encode an archival history
- of the data structure.
- .PP
- For example, a small Venti hierarchy might look like:
- .PS
- .ps 8
- .vs 10
- boxwid=0.1
- boxht=0.1
- f=0.9
- mb=0.16
- VtRoot: [
- right
- B1: box
- move right 0.1
- "\f(CWVtRoot\fP" ljust
- ]
- Root: [
- right
- B1: box fill f
- B2: box fill f
- B3: box fill f
- move right 0.1
- ] with .nw at VtRoot.sw+(0.2,-.1)
- Level1: [
- RootMeta: [
- box wid mb
- ]
- MetaSource: [
- right
- B1: box wid 5*mb
- ] with .nw at RootMeta.sw+(0,-.1)
- Source: [
- right
- B1: box fill f
- B2: box fill f
- B3: box fill f
- B4: box fill f
- B5: box fill f
- B6: box fill f
- B7: box fill f
- B8: box fill f
- ] with .nw at MetaSource.sw+(0,-.1)
- SB1: box invis at Source.B1
- SB2: box invis at Source.B2
- SB3: box invis at Source.B3
- ] with .nw at Root.sw+(0.4,-.1)
- Level2: [
- MetaSource: [
- right
- B1: box wid 5*mb
- ]
- Source: [
- right
- B1: box fill f
- B2: box fill f
- B3: box fill f
- B4: box fill f
- B5: box fill f
- B6: box fill f
- B7: box fill f
- B8: box fill f
- ] with .nw at MetaSource.sw+(0,-.1)
- File: box wid 0.8 with .nw at Source.sw+(0,-.1)
- ] with .nw at Level1.sw+(0.6,-.1)
- line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w
- line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w
- line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w
- line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w
- line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w
- line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w
- line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w
- [
- KEY: box wid 1.5 invis "Key"
- line from KEY.sw to KEY.se
- k = -.1
- kk=0.5
- A: [
- box wid 4*boxwid
- "Venti file" ljust with .w at last box .w+(kk,0)
- ] with .nw at KEY.sw+(0,2*k)
- B: [
- box fill f
- "Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0)
- ] with .nw at A.sw+(0,k)
- C: [
- right
- CC: box fill f
- box fill f
- box fill f
- box fill f
- "Venti directory" ljust with .w at CC.w+(kk,0)
- ] with .nw at B.sw+(0,k)
- D: [
- line -> right 3*boxwid
- "Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
- ] with .nw at C.sw+(0,k)
- ] with .nw at VtRoot.nw+(3,0)
- .PE
- .LP
- Venti files are shown as white boxes, while directories are shown
- as shaded boxes. Each shaded square represents a
- .CW VtEntry .
- Arrows represent pointers from
- .CW VtEntry
- structures to other
- Venti files or directories.
- .PP
- The hierarchical structure provided by Venti files and directories
- can be used as the base for more complicated data structures.
- Because this structure captures all the information
- about pointers to other blocks, tools written to traverse
- Venti hierarchies can traverse the more complicated
- data structures as well.
- For example,
- .I venti/copy
- (see
- .I venti (1))
- copies a Venti hierarchy from one Venti server to another,
- given the root
- .CW VtEntry .
- Because the traditional file system described in later sections is
- layered on a Venti hierarchy,
- .I venti/copy
- can copy it without fully understanding its structure.
- .NH 1
- Vac file system format
- .HP
- The Venti archive format
- .I vac
- builds a traditional file system using a Venti hierarchy.
- Each vac file is implemented as a Venti file;
- each vac directory is implemented as a Venti
- directory and a Venti file to provide traditional file system metadata.
- The metadata is stored in a structure called a
- .CW DirEntry :
- .P1
- DirEntry:
- .ta +\w' 'u +\w' 'u
- magic[4] \f(CW0x1c4d9072\fP (DirMagic)\fP
- version[2] \f(CW9\fP
- elem[s] \fRname (final path element only)\fP
- entry[4] \fRentry number for Venti file or directory\fP
- gen[4] \fRgeneration number\fP
- mentry[4] \fRentry number for Venti file holding metadata\fP
- mgen[4] \fRgeneration number\fP
- qid[8] \fRunique file serial number\fP
- uid[s] \fRowner\fP
- gid[s] \fRgroup\fP
- mid[s] \fRlast modified by\fP
- mtime[4] \fRlast modification time\fP
- ctime[4] \fRcreation time\fP
- atime[4] \fRlast access time\fP
- mode[4] \fRmode bits\fP
- .P2
- The notation
- .CW name[s]
- denotes a string stored as a two-byte length
- and then that many bytes.
- The above describes Version 9 of the
- .CW DirEntry
- format. Versions 7 and 8 are very similar; they can be
- read by the current
- .I vac
- source code but are not written.
- Earlier versions were not widespread.
- A
- .CW DirEntry
- may be followed by optional extension sections, though none
- are currently used.
- The
- .CW mode
- bits include bits commonly used by
- Unix and Windows, in addition to those used by Plan 9.
- .PP
- The
- .CW entry
- field is an index into the parallel Venti directory.
- The
- .CW gen
- field must match the
- .CW gen
- field in the corresponding
- .CW VtEntry
- in the directory;
- it is used to detect
- stale indices.
- Similarly,
- .CW mentry
- and
- .CW mgen
- are the index and generation number
- for the metadata Venti file,
- if the
- .CW DirEntry
- describes a vac directory.
- .PP
- The relation between Venti files and directories and
- vac files and directories can be seen in this figure:
- .PS
- .ps 8
- .vs 10
- boxwid=0.1
- boxht=0.1
- f=0.9
- mb=0.16
- VtRoot: [
- right
- B1: box
- move right 0.1
- "\f(CWVtRoot\fP" ljust
- ]
- SuperRoot: [
- right
- B1: box fill f
- move right 0.1
- "fs root block" ljust
- ] with .nw at VtRoot.sw + (0.2, -.2)
- Root: [
- right
- B1: box fill f
- B2: box fill f
- B3: box fill f
- move right 0.1
- "root directory info block" ljust
- ] with .nw at SuperRoot.sw+(0.2, -.2)
- Level1: [
- RootMeta: [
- box wid mb
- move right 0.1
- "root metadata" ljust
- ]
- MetaSource: [
- right
- B1: box wid mb
- B2: box wid mb
- B3: box wid mb
- B4: box wid mb
- B5: box wid mb
- ] with .nw at RootMeta.sw+(0,-.2)
- MB1: box wid mb invis at MetaSource.B1
- MB2: box wid mb invis at MetaSource.B2
- MB3: box wid mb invis at MetaSource.B3
- MB4: box wid mb invis at MetaSource.B4
- MB5: box wid mb invis at MetaSource.B5
- Source: [
- right
- B1: box fill f
- B2: box fill f
- B3: box fill f
- B4: box fill f
- B5: box fill f
- B6: box fill f
- B7: box fill f
- B8: box fill f
- ] with .nw at MetaSource.sw+(0,-.1)
- SB1: box invis at Source.B1
- SB2: box invis at Source.B2
- SB3: box invis at Source.B3
- SB4: box invis at Source.B4
- SB5: box invis at Source.B5
- SB6: box invis at Source.B6
- SB7: box invis at Source.B7
- SB8: box invis at Source.B8
- ] with .nw at Root.sw+(0.4,-.2)
- Level2: [
- MetaSource: [
- right
- B1: box wid mb
- B2: box wid mb
- B3: box wid mb
- B4: box wid mb
- B5: box wid mb
- ]
- Source: [
- right
- B1: box fill f
- B2: box fill f
- B3: box fill f
- B4: box fill f
- B5: box fill f
- B6: box fill f
- B7: box fill f
- B8: box fill f
- ] with .nw at MetaSource.sw+(0,-.1)
- File: box wid 0.8 with .nw at Source.sw+(0,-.2)
- ] with .nw at Level1.sw+(0.6,-.2)
- line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w
- line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w
- line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w
- line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w
- line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w
- line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w
- line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w
- line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w
- arrowwid = arrowwid/2
- arrowht = arrowht/2
- line -> from Level1.MB1 to Level1.SB1.n
- line -> from Level1.MB2 to Level1.SB2.n
- line -> from Level1.MB2 to Level1.SB3.n
- line -> from Level1.MB4 to Level1.SB7.n
- line -> from Level1.MB5 to Level1.SB5.n
- arrowwid = arrowwid * 2
- arrowht = arrowht * 2
- box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
- box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
- box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2
- [
- KEY: box wid 1.5 invis "Key"
- line from KEY.sw to KEY.se
- k = -.1
- kk=0.5
- A: [
- box wid 4*boxwid
- "Venti file" ljust with .w at last box .w+(kk,0)
- ] with .nw at KEY.sw+(0,2*k)
- B: [
- box fill f
- "Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0)
- ] with .nw at A.sw+(0,k)
- C: [
- right
- CC: box fill f
- box fill f
- box fill f
- box fill f
- "Venti directory" ljust with .w at CC.w+(kk,0)
- ] with .nw at B.sw+(0,k)
- D: [
- line -> right 3*boxwid
- "Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
- ] with .nw at C.sw+(0,k)
- DD: [
- box dotted wid 4*boxwid
- "Vac file" ljust with .w at last box .w+(kk,0)
- ] with .nw at D.sw+(0,k)
- E: [
- box wid mb
- "Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0)
- ] with .nw at DD.sw+(0,k)
- G: [
- box dashed wid 4*boxwid
- "Vac directory" ljust with .w at last box .w+(kk,0)
- ] with .nw at E.sw+(0,k)
- H: [
- arrowwid = arrowwid/2
- arrowht = arrowht/2
- line -> right 1.5*boxwid
- "Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0)
- arrowwid = arrowwid * 2
- arrowht = arrowht * 2
- ] with .nw at G.sw+(0,k)
- ] with .nw at VtRoot.nw+(3,0)
- .PE
- .LP
- In reality, the story is slightly more complicated.
- The metadata file in a Vac directory
- is not just the concatenation of
- .CW DirEntry
- structures.
- Instead, it is the concatenation of
- .CW MetaBlocks .
- A
- .CW MetaBlock
- contains some number of
- .CW DirEntry
- structures along with a sorted index to make it easy
- to look for a particular
- .CW DirEntry
- by its
- .CW elem
- field.
- The details are in the source code.
- .PP
- As shown in the diagram,
- the root directory of the file system is summarized by
- three
- .CW VtEntry
- structures describing
- the Venti directory for the children of the root,
- the Venti file for the metadata describing the children of the root,
- and a Venti file holding metadata for the root directory itself.
- These
- .CW VtEntry
- structures are placed in a Venti directory of their own,
- described by the single
- .CW VtEntry
- in the
- root block.
- .NH 1
- Fossil file system format
- .HP
- Fossil uses the vac format, with some small changes.
- The changes only affect the data on the local disk; the data
- archived to Venti is exactly in vac format.
- .PP
- Blocks stored on local disk may contain scores pointing at local disk
- blocks or at Venti blocks.
- Local block addresses are stored as 20-byte scores in which the first 16 bytes
- are all zero and the last 4 bytes specify a block number in the disk.
- Before a block is archived, all the
- blocks it points to must be archived, and the local scores in the block
- must be changed to Venti scores.
- Using block addresses rather than content hashes for local data
- makes the local file system easier to manage: if a local block's contents
- change, the pointer to the block does not need to change.
- .NH 2
- Snapshots
- .HP
- Fossil is an archival file server.
- It takes periodic snapshots of the file system,
- which are made accessible through the file system.
- Specifically, the active file system is presented in
- .CW /active .
- Ephemeral snapshots (those that are kept on local disk and eventually deleted)
- are presented in
- \f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR,
- where
- .I yyyy
- is the full year,
- .I mm
- is the month number,
- .I dd
- is the day number,
- .I hh
- is the hour,
- and
- .I mm
- is the minute.
- Archival snapshots (those that are archived to Venti and persist forever)
- are presented in
- \f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR,
- where
- .I yyyy ,
- .I mm ,
- and
- .I dd
- are year, month, and day as before,
- and
- .I s
- is a sequence number if more than one
- archival snapshot is done in a day.
- For the first snapshot,
- .I s
- is null.
- For the subsequent snapshots,
- .I s
- is
- .CW .1 ,
- .CW .2 ,
- .CW .3 ,
- etc.
- .PP
- To implement the snapshots, the file server maintains a
- current
- .I epoch
- for the active file system.
- Each local block has a label that records, among other things,
- the epoch in which the block was allocated.
- If a block was allocated in an epoch earlier than the current one,
- it is immutable and treated as copy-on-write.
- Taking a snapshot can be accomplished by
- recording the address of the current root block and then
- incrementing the epoch number.
- Notice that the copy-on-write method makes
- snapshots both time efficient and space efficient.
- The only time cost is waiting for all current file system
- requests to finish and then incrementing a counter.
- After a snapshot, blocks only get copied when they are
- next modified, so the per-snapshot
- space requirement is proportional
- to the amount of new data rather than the total
- size of the file system.
- .PP
- The blocks in the archival snapshots are moved to Venti,
- but the blocks in the ephemeral snapshots take up space
- in the local disk file.
- To allow reclamation of this disk space, the file system
- maintains a
- .I low
- .I epoch ,
- which is the epoch of the earliest ephemeral snapshot
- still available.
- Fossil only allows access to snapshots with epoch numbers
- between the
- low epoch and the current epoch
- (also called the high epoch).
- Incrementing the low epoch thus makes old
- snapshots inaccessible.
- The space required to store those snapshots can then
- be reclaimed, as described below.
- .NH 2
- Local blocks
- .HP
- The bulk of the local disk file is the local blocks.
- Each block has a 14-byte label associated with it, of the format:
- .P1
- Label:
- .ta +\w' 'u +\w' 'u
- state[1] \fRblock state\fP
- type[1] \fRblock type\fP
- epoch[4] \fRallocation epoch\fP
- epochClose[4] \fRclose epoch\fP
- tag[4] \fRrandom tag\fP
- .P2
- .LP
- The
- .CW type
- is an analogue of the block types described earlier,
- though different names are used, to distinguish between
- pointers blocks in a hash tree for a data stream
- and pointer blocks for a directory stream.
- The
- .CW epoch
- was mentioned in the last section.
- The other fields are explained below.
- .PP
- There are two distinguished blocks states
- .CW BsFree
- .CW 0x00 ) (
- and
- .CW BsBad
- .CW 0xFF ), (
- which mark blocks that are available for allocation
- and blocks that are bad and should be avoided.
- If
- .CW state
- is not one of these values, it is a bitwise
- .I or ' `
- of the following flags:
- .P1
- .ta +\w' 'u +\w' 'u
- 0x01 BsAlloc \fRblock is in use\fP
- 0x02 BsCopied \fRblock has been copied\fP
- 0x04 BsVenti \fRblock has been stored on Venti\fP
- 0x08 BsClosed \fRblock has been unlinked from active file system\fP
- .P2
- .LP
- The flags are explained as they arise in the discussions below.
- .PP
- It is convenient to store some extra fields in the
- .CW VtEntry
- structure when it describes a Venti file or directory
- stored on local disk.
- Specifically, we set the
- .CW VtEntryLocal
- flag bit
- and then use the bytes 7-16 of the score (which would
- otherwise be zero, since it is a local score) to hold these fields:
- .P1
- .ta +\w' 'u +\w' 'u
- archive[1] \fRboolean: this is an archival snapshot\fP
- snap[4] \fRepoch number if root of snapshot\fP
- tag[4] \fRrandom tag\fP
- .P2
- .LP
- The extended
- .CW VtEntry
- structure is called an
- .CW Entry .
- The
- .CW tag
- field
- in the
- .CW Label
- and the
- .CW Entry
- is used to identify dangling pointers or other file system corruption:
- all the local blocks in a hash tree must
- have tags matching the tag in the
- .CW Entry .
- If this
- .CW Entry
- points at the root of a snapshot,
- the
- .CW snap
- field is the epoch of the snapshot.
- If the snapshot is intended to be archived to Venti,
- the
- .CW archive
- field is non-zero.
- .NH 2
- Block reclamation
- .HP
- The blocks in the active file system form a tree: each
- block has only one parent.
- Once a copy-on-write block
- .I b
- is replaced by its copy, it is no longer
- needed by the active file system.
- At this point,
- .I b
- is unlinked from the active file system.
- We say that
- .I b
- is now
- .I closed :
- it is needed only for snapshots.
- When a block is closed, the
- .CW BsClosed
- bit is set in its state, and the current epoch (called the block's closing epoch)
- is stored in the
- .CW epochClose
- label field.
- (Open blocks have an
- .CW epochClose
- of
- .CW ~0 ).
- .PP
- A block is referenced by snapshots with epochs
- between the block's allocation epoch and its closing epoch.
- Once the file system's low epoch grows to be greater than or equal to the block's
- closing epoch, the block is no longer needed for any snapshots
- and can be reused.
- .PP
- In a typical configuration, where nightly archival snapshots
- are taken and written to Venti, it is desirable to reclaim
- the space occupied by now-archived blocks if possible.
- To do this, Fossil keeps track of whether the pointers
- in each block are unique to that block.
- When a block
- .I bb
- is allocated, a pointer to
- .I bb
- is written into exactly one active block (say,
- .I b ).
- In the absence of snapshots, the pointer to
- .I bb
- will remain unique to
- .I b ,
- so that if the pointer is zeroed,
- .I bb
- can be immediately reused.
- Snapshots complicate this invariant:
- when
- .I b
- is copied-on-write, all its pointers
- are no longer unique to it.
- At time of the copy, the
- .CW BsCopied
- state bit in the block's label
- is set to note the duplication of the pointers contained within.
- .NH 2
- Disk layout
- .HP
- The file system header describes the file system layout and has this format:
- .P1
- .ta +\w' 'u +\w' 'u
- Header:
- magic[4] \fR0x3776AE89 (HeaderMagic)\fP
- version[2] \fR1 (HeaderVersion)\fP
- blockSize[2] \fIfile system block size\fP
- super[4] \fRblock offset of super block\fP
- label[4] \fRblock offset of labels\fP
- data[4] \fRdata blocks\fP
- end[4] \fRend of file system\fP
- .P2
- .LP
- The corresponding file system layout is:
- .PS
- .ps 8
- .vs 9
- boxwid=0.75
- boxht=0.15
- Empty: box "empty" ht 0.25
- Header: box "header" with .n at Empty.s
- Empty2: box "empty" with .n at Header.s
- Super: box "super block" with .n at Empty2.s
- Label: box "label" "blocks" with .n at Super.s ht 0.25
- Data: box "data" "blocks" with .n at Label.s ht 0.3
- " 0" ljust at Empty.ne
- " 128kB" ljust at Header.ne
- " \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne
- " \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne
- " \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne
- " \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se
- "" at (-1,0)
- "" at (6,0)
- .PE
- .LP
- The numbers to the right of the blocks are byte offsets
- of the boundaries.
- .LP
- The super block describes the file system itself and looks like:
- .P1
- .ta +\w' 'u +\w' 'u
- Super:
- magic[4] \fR0x2340A3B1 (SuperMagic)\fP
- version[2] \fR1 (SuperVersion)\fP
- epochLow[4] \fRfile system low epoch\fP
- epochHigh[4] \fRfile system high (active) epoch\fP
- qid[8] \fRnext qid to allocate\fP
- active[4] \fRdata block number: root of active file system\fP
- next[4] \fRdata block number: root of next file system to archive\fP
- current[4] \fRdata block number: root of file system currently being archived\fP
- last[20] \fRVenti score of last successful archive\fP
- name[128] \fRname of file system (just a comment)\fP
- .P2
- .LP
- .NH 1
- Fossil server
- .HP
- The Fossil server is a user-space program that runs on a standard Plan 9 kernel.
- .NH 2
- Process structure
- .PP
- The file server is structured as a set of processes synchronizing
- mostly through message passing along queues.
- The processes are given names, which can be seen in the output of
- .CW ps
- .CW -a .
- .PP
- .CW Listen
- processes announce on various network addresses.
- A
- .CW con
- process handles each incoming connection, reading 9P requests
- and adding them to a central message queue.
- .CW Msg
- processes remove 9P requests from the queue,
- handle them, and write the responses to the appropriate
- file descriptors.
- .PP
- The
- .CW disk
- process handles disk I/O requests made by the other processes.
- The
- .CW flush
- process writes dirty blocks from the in-memory block cache to disk.
- The
- .CW unlink
- process frees previously linked blocks once the blocks that point at them
- have been written to disk.
- .PP
- A
- .CW consI
- reads from each console file (typically a pipe posted in
- .CW /srv ),
- adding the typed characters to the input queue.
- The
- .CW cons
- process echoes input and runs the commands, saving
- output in a ring buffer.
- Because there is only one
- .CW cons
- process, only one console command may be executing at a time.
- A
- .CW consO
- process copies this ring buffer to each console file.
- .PP
- The
- .CW periodic
- process runs periodic events, like
- flushing the root metadata to disk or
- taking snapshots of the file system.
- .NH 2
- Block cache
- .HP
- Fossil maintains an in-memory block cache which
- holds both local disk blocks and Venti blocks.
- Cache eviction follows a least recently used policy.
- Dirty blocks are restricted to at most half the cache.
- This can be changed by editing
- .CW DirtyPercentage
- in
- .CW dat.h .
- .PP
- The block cache uses soft updates [1] to ensure that the on-disk
- file system is always self-consistent.
- Thus there is no
- .I halt
- console command
- and no need to check a file system
- that was shut down without halting.
- .NH 2
- Archiving
- .HP
- A background process writes blocks in archival snapshots to Venti.
- Although
- .CW /archive/\fIyyyy\fP/\fImmdds\fR
- is a copy of only
- .CW /active
- at the time of the snapshot,
- the archival process archives the
- entire file tree rather than just
- the subtree rooted at
- .CW /active .
- The snapshots
- .CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm
- are stored as empty directories.
- Once all the blocks have been archived,
- a
- .CW VtRoot
- header for the file system is archived.
- The score of that header is recorded in
- .CW super.score
- and also printed on the file server console.
- The score can used by
- .I flfmt
- to restore a file system (see
- .I fossil (4)).
- .NH 2
- Contrast with the old file server
- .HP
- The most obvious difference between Fossil and the
- old Plan 9 file server [2] is that Fossil uses a Venti server as
- its archival storage in place of a WORM juke box.
- There are a few other architectural differences to be
- aware of.
- .PP
- Fossil is a user-level program run on a standard kernel.
- .PP
- Fossil does not have any way to concatenate, stripe, or
- mirror disk files. For functionality similar to the old file server's
- configuration strings, use the experimental file stack device
- (see
- .I fs (3)).
- .PP
- Fossil speaks only 9P2000. Old 9P (aka 9P1) is not supported.
- .PP
- ... XXX words about converting an old file system to fossil?
- .NH 1
- References
- .LP
- [1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,
- and Yale N. Patt.
- ``Soft Updates: A Solution to the Metadata Update Problem
- in File Systems,''
- .I "ACM Transactions on Computer Systems" ,
- Vol 18., No. 2, May 2000, pp. 127\-153.
- .LP
- [2] Sean Quinlan, ``A Cached WORM File System,''
- .I "Software\(emPractice and Experience" ,
- Vol 21., No 12., December 1991, pp. 1289\-1299.
- .LP
- [3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,''
- .I "Usenix Conference on File and Storage Technologies" ,
- 2002.
|