venti.html 54 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044104510461047104810491050105110521053105410551056105710581059106010611062106310641065106610671068106910701071107210731074107510761077107810791080108110821083108410851086108710881089109010911092109310941095109610971098109911001101110211031104110511061107110811091110111111121113111411151116111711181119112011211122112311241125112611271128112911301131113211331134113511361137113811391140114111421143114411451146114711481149115011511152115311541155115611571158115911601161116211631164116511661167116811691170117111721173117411751176117711781179118011811182118311841185118611871188118911901191119211931194119511961197119811991200120112021203120412051206120712081209121012111212121312141215121612171218121912201221
  1. <!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN">
  2. <html>
  3. <head>
  4. <meta http-equiv="Content-Language" content="en-us">
  5. <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  6. <title>Venti: a new approach to archival storage</title>
  7. </head>
  8. <body bgcolor="white">
  9. <h1>Venti: a new approach to archival storage</h1>
  10. <p>
  11. Sean Quinlan and Sean Dorward
  12. <br>
  13. Bell Labs, Lucent Technologies
  14. <p>
  15. <h1>Abstract</h1>
  16. <p>
  17. This paper describes a network storage system, called Venti, intended
  18. for archival data. In this system, a unique hash of a block's
  19. contents acts as the block identifier for read and write operations.
  20. This approach enforces a write-once policy, preventing accidental or
  21. malicious destruction of data. In addition, duplicate copies of a
  22. block can be coalesced, reducing the consumption of storage and
  23. simplifying the implementation of clients. Venti is a building block
  24. for constructing a variety of storage applications such as logical
  25. backup, physical backup, and snapshot file systems.
  26. <p>
  27. We have built a prototype of the system and present some preliminary
  28. performance results. The system uses magnetic disks as the storage
  29. technology, resulting in an access time for archival data that is
  30. comparable to non-archival data. The feasibility of the write-once
  31. model for storage is demonstrated using data from over a decade's use
  32. of two Plan 9 file systems.
  33. <p>
  34. <h1>1. Introduction</h1>
  35. <p>
  36. Archival storage is a second class citizen. Many computer
  37. environments provide access to a few recent versions of the
  38. information stored in file systems and databases, though this access
  39. can be tedious and may require the assistance of a system
  40. administrator. Less common is the ability for a user to examine data
  41. from last month or last year or last decade. Such a feature may not
  42. be needed frequently, but when it is needed it is often crucial.
  43. <p>
  44. The growth in capacity of storage technologies exceeds the ability of
  45. many users to generate data, making it practical to archive data in
  46. perpetuity. Plan 9, the computing environment that the authors use,
  47. includes a file system that stores archival data to an optical jukebox
  48. [16, 17]. Ken Thompson observed that, for our usage patterns, the
  49. capacity of the jukebox could be considered infinite. In the time it
  50. took for us to fill the jukebox, the improvement in technology would
  51. allow us to upgrade to a new jukebox with twice the capacity.
  52. <p>
  53. Abundant storage suggests that an archival system impose a write-once
  54. policy. Such a policy prohibits either a user or administrator from
  55. deleting or modifying data once it is stored. This approach greatly
  56. reduces the opportunities for accidental or malicious data loss and
  57. simplifies the system's implementation.
  58. <p>
  59. Moreover, our experience with Plan 9 is that a write-once policy
  60. changes the way one views storage. Obviously, some data is temporary,
  61. derivative, or so large that it is either undesirable or impractical
  62. to retain forever and should not be archived. However, once it is
  63. decided that the data is worth keeping, the resources needed to store
  64. the data have been consumed and cannot be reclaimed. This eliminates
  65. the task of periodically "cleaning up" and deciding whether the data
  66. is still worth keeping. More thought is required before storing the
  67. data to a write-once archive, but as the cost of storage continues to
  68. fall, this becomes an easy decision.
  69. <p>
  70. This paper describes the design and implementation of an archival
  71. server, called Venti. The goal of Venti is to provide a write-once
  72. archival repository that can be shared by multiple client machines and
  73. applications. In addition, by using magnetic disks as the primary
  74. storage technology, the performance of the system approaches that of
  75. non-archival storage.
  76. <p>
  77. <h1>2. Background</h1>
  78. <p>
  79. A prevalent form of archival storage is the regular backup of data to
  80. magnetic tape [15]. A typical scenario is to provide backup as a
  81. central service for a number of client machines. Client software
  82. interfaces with a database or file system and determines what data to
  83. back up. The data is copied from the client to the tape device, often
  84. over a network, and a record of what was copied is stored in a catalog
  85. database.
  86. <p>
  87. Restoring data from a tape backup system can be tedious and error
  88. prone. The backup system violates the access permission of the file
  89. system, requiring a system administrator or privileged software to
  90. perform the task. Since they are tedious, restore operations are
  91. infrequent and problems with the process may go undetected. Potential
  92. sources of error abound: tapes are mislabeled or reused or lost,
  93. drives wander out of alignment and cannot read their old tapes,
  94. technology becomes obsolete.
  95. <p>
  96. For tape backup systems, a tradeoff exists between the performance of
  97. backup and restore operations [1]. A full backup simplifies the
  98. process of restoring data since all the data is copied to a continuous
  99. region on the tape media. For large file systems and databases,
  100. incremental backups are more efficient to generate, but such backups
  101. are not self-contained; the data for a restore operation is scattered
  102. across multiple incremental backups and perhaps multiple tapes. The
  103. conventional solution is to limit the extent of this scattering by
  104. performing a full backup followed by a small number of incremental
  105. backups.
  106. <p>
  107. File systems such as Plan 9 [16, 17], WAFL [5], and AFS [7] provide a
  108. more unified approach to the backup problem by implementing a snapshot
  109. feature. A snapshot is a consistent read-only view of the file system
  110. at some point in the past. The snapshot retains the file system
  111. permissions and can be accessed with standard tools (ls, cat, cp,
  112. grep, diff) without special privileges or assistance from an
  113. administrator. In our experience, snapshots are a relied-upon and
  114. frequently-used resource because they are always available and easy to
  115. access.
  116. <p>
  117. Snapshots avoid the tradeoff between full and incremental backups.
  118. Each snapshot is a complete file system tree, much like a full backup.
  119. The implementation, however, resembles an incremental backup because
  120. the snapshots and the active file system share any blocks that remain
  121. unmodified; a snapshot only requires additional storage for the blocks
  122. that have changed. To achieve reasonable performance, the device that
  123. stores the snapshots must efficiently support random access, limiting
  124. the suitability of tape storage for this approach.
  125. <p>
  126. In the WAFL and AFS systems, snapshots are ephemeral; only a small
  127. number of recent versions of the file system are retained. This
  128. policy is reasonable since the most recent versions of files are the
  129. most useful. For these systems, archival storage requires an
  130. additional mechanism such as tape backup.
  131. <p>
  132. The philosophy of the Plan 9 file system is that random access storage
  133. is sufficiently cheap that it is feasible to retain snapshots
  134. permanently. The storage required to retain all daily snapshots of a
  135. file system is surprisingly modest; later in the paper we present
  136. statistics for two file servers that have been in use over the last 10
  137. years.
  138. <p>
  139. Like Plan 9, the Elephant file system [18] retains many versions of
  140. data. This system allows a variety of storage reclamation policies
  141. that determine when a version of a file should be deleted. In
  142. particular, "landmark" versions of files are retained permanently and
  143. provide an archival record.
  144. <p>
  145. <h1>3. The Venti Archival Server</h1>
  146. <p>
  147. Venti is a block-level network storage system intended for archival
  148. data. The interface to the system is a simple protocol that enables
  149. client applications to read and write variable sized blocks of data.
  150. Venti itself does not provide the services of a file or backup system,
  151. but rather the backend archival storage for these types of
  152. applications.
  153. <p>
  154. Venti identifies data blocks by a hash of their contents. By using a
  155. collision-resistant hash function with a sufficiently large output, it
  156. is possible to consider the hash of a data block as unique. Such a
  157. unique hash is called the fingerprint of a block and can be used as
  158. the address for read and write operations. This approach results in a
  159. storage system with a number of interesting properties.
  160. <p>
  161. As blocks are addressed by the fingerprint of their contents, a block
  162. cannot be modified without changing its address; the behavior is
  163. intrinsically write-once. This property distinguishes Venti from most
  164. other storage systems, in which the address of a block and its
  165. contents are independent.
  166. <p>
  167. Moreover, writes are idempotent. Multiple writes of the same data can
  168. be coalesced and do not require additional storage space. This
  169. property can greatly increase the effective storage capacity of the
  170. server since it does not rely on the behavior of client applications.
  171. For example, an incremental backup application may not be able to
  172. determine exactly which blocks have changed, resulting in unnecessary
  173. duplication of data. On Venti, such duplicate blocks will be
  174. discarded and only one copy of the data will be retained. In fact,
  175. replacing the incremental backup with a full backup will consume the
  176. same amount of storage. Even duplicate data from different
  177. applications and machines can be eliminated if the clients write the
  178. data using the same block size and alignment.
  179. <p>
  180. The hash function can be viewed as generating a universal name space
  181. for data blocks. Without cooperating or coordinating, multiple
  182. clients can share this name space and share a Venti server. Moreover,
  183. the block level interface places few restrictions on the structures
  184. and format that clients use to store their data. In contrast,
  185. traditional backup and archival systems require more centralized
  186. control. For example, backup systems include some form of job
  187. scheduler to serialize access to tape devices and may only support a
  188. small number of predetermined data formats so that the catalog system
  189. can extract pertinent meta-data.
  190. <p>
  191. Venti provides inherent integrity checking of data. When a block is
  192. retrieved, both the client and the server can compute the fingerprint
  193. of the data and compare it to the requested fingerprint. This
  194. operation allows the client to avoid errors from undetected data
  195. corruption and enables the server to identify when error recovery is
  196. necessary.
  197. <p>
  198. Using the fingerprint of a block as its identity facilitates features
  199. such as replication, caching, and load balancing. Since the contents
  200. of a particular block are immutable, the problem of data coherency is
  201. greatly reduced; a cache or a mirror cannot contain a stale or out of
  202. date version of a block.
  203. <p>
  204. <h2>3.1. Choice of Hash Function</h2>
  205. <p>
  206. The design of Venti requires a hash function that generates a unique
  207. fingerprint for every data block that a client may want to store.
  208. Obviously, if the size of the fingerprint is smaller than the size of
  209. the data blocks, such a hash function cannot exist since there are
  210. fewer possible fingerprints than blocks. If the fingerprint is large
  211. enough and randomly distributed, this problem does not arise in
  212. practice. For a server of a given capacity, the likelihood that two
  213. different blocks will have the same hash value, also known as a
  214. collision, can be determined. If the probability of a collision is
  215. vanishingly small, we can be confident that each fingerprint is
  216. unique.
  217. <p>
  218. It is desirable that Venti employ a cryptographic hash function. For
  219. such a function, it is computationally infeasible to find two distinct
  220. inputs that hash to the same value [10]. This property is important
  221. because it prevents a malicious client from intentionally creating
  222. blocks that violate the assumption that each block has a unique
  223. fingerprint. As an additional benefit, using a cryptographic hash
  224. function strengthens a client's integrity check, preventing a
  225. malicious server from fulfilling a read request with fraudulent data.
  226. If the fingerprint of the returned block matches the requested
  227. fingerprint, the client can be confident the server returned the
  228. original data.
  229. <p>
  230. Venti uses the Sha1 hash function [13] developed by the US National
  231. Institute for Standards and Technology (NIST). Sha1 is a popular hash
  232. algorithm for many security systems and, to date, there are no known
  233. collisions. The output of Sha1 is a 160 bit (20 byte) hash value.
  234. Software implementations of Sha1 are relatively efficient; for
  235. example, a 700Mhz Pentium 3 can compute the Sha1 hash of 8 Kbyte data
  236. blocks in about 130 microseconds, a rate of 60 Mbytes per second.
  237. <p>
  238. Are the 160 bit hash values generated by Sha1 large enough to ensure
  239. the fingerprint of every block is unique? Assuming random hash values
  240. with a uniform distribution, a collection of n different data blocks
  241. and a hash function that generates b bits, the probability p that
  242. there will be one or more collisions is bounded by the number of pairs
  243. of blocks multiplied by the probability that a given pair will
  244. collide, i.e.
  245. <p>
  246. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  247. <img src="probablity.gif" ALT="probablity">
  248. <p>
  249. Today, a large storage system may contain a petabyte (10^15 bytes) of data.
  250. Consider an even larger system that contains an exabyte (10^18 bytes)
  251. stored as 8 Kbyte blocks (~10^14 blocks). Using the Sha1 hash function, the
  252. probability of a collision is less than 10^-20. Such a scenario seems
  253. sufficiently unlikely that we ignore it and use the Sha1 hash as a
  254. unique identifier for a block. Obviously, as storage technology
  255. advances, it may become feasible to store much more than an exabyte,
  256. at which point it maybe necessary to move to a larger hash function.
  257. NIST has already proposed variants of Sha1 that produce 256, 384, and
  258. 512 bit results [14]. For the immediate future, however, Sha1 is a
  259. suitable choice for generating the fingerprint of a block.
  260. <p>
  261. <h2>3.2. Choice of Storage Technology</h2>
  262. <p>
  263. When the Plan 9 file system was designed in 1989, optical jukeboxes
  264. offered high capacity with respectable random access performance and
  265. thus were an obvious candidate for archival storage. The last decade,
  266. however, has seen the capacity of magnetic disks increase at a far
  267. faster rate than optical technologies [20]. Today, a disk array costs
  268. less than the equivalent capacity optical jukebox and occupies less
  269. physical space. Disk technology is even approaching tape in cost per
  270. bit.
  271. <p>
  272. Magnetic disk storage is not as stable or permanent as optical media.
  273. Reliability can be improved with technology such as RAID, but unlike
  274. write-once optical disks, there is little protection from erasure due
  275. to failures of the storage server or RAID array firmware. This issue
  276. is discussed in Section 7.
  277. <p>
  278. Using magnetic disks for Venti has the benefit of reducing the
  279. disparity in performance between conventional and archival storage.
  280. Operations that previously required data to be restored to magnetic
  281. disk can be accomplished directly from the archive. Similarly, the
  282. archive can contain the primary copy of often-accessed read-only data.
  283. In effect, archival data need not be further down the storage
  284. hierarchy; it is differentiated by the write-once policy of the
  285. server.
  286. <p>
  287. <h1>4. Applications</h1>
  288. <p>
  289. Venti is a building block on which to construct a variety of storage
  290. applications. Venti provides a large repository for data that can be
  291. shared by many clients, much as tape libraries are currently the
  292. foundation of many centralized backup systems. Applications need to
  293. accommodate the unique properties of Venti, which are different from
  294. traditional block level storage devices, but these properties enable a
  295. number of interesting features.
  296. <p>
  297. Applications use the block level service provided by Venti to store
  298. more complex data structures. Data is divided into blocks and written
  299. to the server. To enable this data to be retrieved, the application
  300. must record the fingerprints of these blocks. One approach is to pack
  301. the fingerprints into additional blocks, called pointer blocks, that
  302. are also written to the server, a process that can be repeated
  303. recursively until a single fingerprint is obtained. This fingerprint
  304. represents the root of a tree of blocks and corresponds to a
  305. hierarchical hash of the original data.
  306. <p>
  307. A simple data structure for storing a linear sequence of data blocks
  308. is shown in Figure 1. The data blocks are located via a fixed depth
  309. tree of pointer blocks which itself is addressed by a root
  310. fingerprint. Applications can use such a structure to store a single
  311. file or to mimic the behavior of a physical device such as a tape or a
  312. disk drive. The write-once nature of Venti does not allow such a tree
  313. to be modified, but new versions of the tree can be generated
  314. efficiently by storing the new or modified data blocks and reusing the
  315. unchanged sections of the tree as depicted in Figure 2.
  316. <p>
  317. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  318. <img src="SimpleTree.gif" ALT="simple tree">
  319. <p>
  320. Figure 1. A tree structure for storing a linear sequence of blocks
  321. <p>
  322. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  323. <img src="ModifiedTree.gif" ALT="modified tree">
  324. <p>
  325. Figure 2. Build a new version of the tree.
  326. <p>
  327. By mixing data and fingerprints in a block, more complex data
  328. structures can be constructed. For example, a structure for storing a
  329. file system may include three types of blocks: directory, pointer, and
  330. data. A directory block combines the meta information for a file and
  331. the fingerprint to a tree of data blocks containing the file's
  332. contents. The depth of the tree can be determined from the size of
  333. the file, assuming the pointer and data blocks have a fixed size.
  334. Other structures are obviously possible. Venti's block-level
  335. interface leaves the choice of format to client applications and
  336. different data structures can coexist on a single server.
  337. <p>
  338. The following sections describes three applications that use Venti as
  339. an archival data repository: a user level archive utility called vac,
  340. a proposal for a physical level backup utility, and our preliminary
  341. work on a new version of the Plan 9 file system.
  342. <p>
  343. <h2>4.1. Vac</h2>
  344. <p>
  345. Vac is an application for storing a collection of files and
  346. directories as a single object, similar in functionality to the
  347. utilities tar and zip. With vac, the contents of the selected files
  348. are stored as a tree of blocks on a Venti server. The root
  349. fingerprint for this tree is written to a vac archive file specified
  350. by the user, which consists of an ASCII representation of the 20 byte
  351. root fingerprint plus a fixed header string, and is always 45 bytes
  352. long. A corresponding program, called unvac, enables the user to
  353. restore files from a vac archive. Naturally, unvac requires access to
  354. the Venti server that contains the actual data, but in most situations
  355. this is transparent. For a user, it appears that vac compresses any
  356. amount of data down to 45 bytes.
  357. <p>
  358. An important attribute of vac is that it writes each file as a
  359. separate collection of Venti blocks, thus ensuring that duplicate
  360. copies of a file will be coalesced on the server. If multiple users
  361. vac the same data, only one copy will be stored on the server.
  362. Similarly, a user may repeatedly vac a directory over time and even if
  363. the contents of the directory change, the additional storage consumed
  364. on the server will be related to the extent of the changes rather than
  365. the total size of the contents. Since Venti coalesces data at the
  366. block level, even files that change may share many blocks with
  367. previous versions and thus require little space on the server; log and
  368. database files are good examples of this scenario.
  369. <p>
  370. On many Unix systems, the dump utility is used to back up file
  371. systems. Dump has the ability to perform incremental backups of data;
  372. a user specifies a dump level, and only files that are new or have
  373. changed since the last dump at this level are written to the archive.
  374. To implement incremental backups, dump examines the modified time
  375. associated with each file, which is an efficient method of filtering
  376. out the unchanged files.
  377. <p>
  378. Vac also implements an incremental option based on the file
  379. modification times. The user specifies an existing vac file and this
  380. archive is used to reduce the number of blocks written to the Venti
  381. server. For each file, vac examines the modified time in both the
  382. file system and the vac archive. If they are the same, vac copies the
  383. fingerprint for the file from the old archive into the new archive.
  384. Copying just the 20-byte fingerprint enables the new archive to
  385. include the entire file without reading the data from the file system
  386. nor writing the data across the network to the Venti server. In
  387. addition, unlike an incremental dump, the resulting archive will be
  388. identical to an archive generated without the incremental option; it
  389. is only a performance improvement. This means there is no need to
  390. have multiple levels of backups, some incremental, some full, and so
  391. restore operations are greatly simplified.
  392. <p>
  393. A variant of the incremental option improves the backup of files
  394. without reference to modification times. As vac reads a file, it
  395. computes the fingerprint for each block. Concurrently, the pointer
  396. blocks of the old archive are examined to determine the fingerprint
  397. for the block at the same offset in the old version of the file. If
  398. the fingerprints are the same, the block does not need to be written
  399. to Venti. Instead, the fingerprint can simply be copied into the
  400. appropriate pointer block. This optimization reduces the number of
  401. writes to the Venti server, saving both network and disk bandwidth.
  402. Like the file level optimization above, the resulting vac file is no
  403. different from the one produced without this optimization. It does,
  404. however, require the data for the file to be read and is only
  405. effective if there are a significant number of unchanged blocks.
  406. <p>
  407. <h2>4.2. Physical backup</h2>
  408. <p>
  409. Utilities such as vac, tar, and dump archive data at the file or
  410. logical level: they walk the file hierarchy converting both data and
  411. meta-data into their own internal format. An alternative approach is
  412. block-level or physical backup, in which the disk blocks that make up
  413. the file system are directly copied without interpretation. Physical
  414. backup has a number of benefits including simplicity and potentially
  415. much higher throughput [8]. A physical backup utility for file
  416. systems that stores the resulting data on Venti appears attractive,
  417. though we have not yet implemented such an application.
  418. <p>
  419. The simplest form of physical backup is to copy the raw contents of
  420. one or mores disk drives to Venti. The backup also includes a tree of
  421. pointer blocks, which enables access to the data blocks. Like vac,
  422. the end result is a single fingerprint representing the root of the
  423. tree; that fingerprint needs to be recorded outside of Venti.
  424. <p>
  425. Coalescing duplicate blocks is the main advantage of making a physical
  426. backup to Venti rather than copying the data to another storage medium
  427. such as tape. Since file systems are inherently block based, we
  428. expect coalescing to be effective. Not only will backups of a file
  429. system over time share many unchanged blocks, but even file systems
  430. for different machines that are running the same operating system may
  431. have many blocks in common. As with vac, the user sees a full backup
  432. of the device, while retaining the storage space advantages of an
  433. incremental backup.
  434. <p>
  435. One enhancement to physical backup is to copy only blocks that are
  436. actively in use in the file system. For most file system formats it
  437. is relatively easy to determine if a block is in use or free without
  438. walking the file system hierarchy. Free blocks generally contain the
  439. remnants of temporary files that were created and removed in the time
  440. between backups and it is advantageous not to store such blocks. This
  441. optimization requires that the backup format be able to represent
  442. missing blocks, which can easily be achieved on Venti by storing a
  443. null value for the appropriate entry in the pointer tree.
  444. <p>
  445. The random access performance of Venti is sufficiently good that it is
  446. possible to use a physical backup without first restoring it to disk.
  447. With operating system support, it is feasible to directly mount a
  448. backup file system image from Venti. Access to this file system is
  449. read only, but it provides a natural method of restoring a subset of
  450. files. For situations where a full restore is required, it might be
  451. possible to do this restore in a lazy fashion, copying blocks from
  452. Venti to the file system as needed, instead of copying the entire
  453. contents of the file system before resuming normal operation.
  454. <p>
  455. The time to perform a physical backup can be reduced using a variety
  456. of incremental techniques. Like vac, the backup utility can compute
  457. the fingerprint of each block and compare this fingerprint with the
  458. appropriate entry in the pointer tree of a previous backup. This
  459. optimization reduces the number of writes to the Venti server. If the
  460. file system provides information about which blocks have changed, as
  461. is the case with WAFL, the backup utility can avoid even reading the
  462. unchanged blocks. Again, a major advantage of using Venti is that the
  463. backup utility can implement these incremental techniques while still
  464. providing the user with a full backup. The backup utility writes the
  465. new blocks to the Venti server and constructs a pointer tree with the
  466. appropriate fingerprint for the unchanged blocks.
  467. <p>
  468. <h2>4.3. Plan 9 File system</h2>
  469. <p>
  470. When combined with a small amount of read/write storage, Venti can be
  471. used as the primary location for data rather than a place to store
  472. backups. A new version of the Plan 9 file system, which we are
  473. developing, exemplifies this approach.
  474. <p>
  475. Previously, the Plan 9 file system was stored on a combination of
  476. magnetic disks and a write-once optical jukebox. The jukebox
  477. furnishes the permanent storage for the system, while the magnetic
  478. disks act as a cache for the jukebox. The cache provides faster file
  479. access and, more importantly, accumulates the changes to the file
  480. system during the period between snapshots. When a snapshot is taken,
  481. new or modified blocks are written from the disk cache to the jukebox.
  482. <p>
  483. The disk cache can be smaller than the active file system, needing
  484. only to be big enough to contain the daily changes to the file system.
  485. However, accesses that miss the cache are significantly slower since
  486. changing platters in the jukebox takes several seconds. This
  487. performance penalty makes certain operations on old snapshots
  488. prohibitively expensive. Also, on the rare occasions when the disk
  489. cache has been reinitialized due to corruption, the file server spends
  490. several days filling the cache before performance returns to normal.
  491. <p>
  492. The new version of the Plan 9 file system uses Venti instead of an
  493. optical jukebox as its storage device. Since the performance of Venti
  494. is comparable to disk, this substitution equalizes access both to the
  495. active and to the archival view of the file system. It also allows
  496. the disk cache to be quite small; the cache accumulates changes to the
  497. file system between snapshots, but does not speed file access.
  498. <p>
  499. <h1>5. Implementation</h1>
  500. <p>
  501. We have implemented a prototype of Venti. The implementation uses an
  502. append-only log of data blocks and an index that maps fingerprints to
  503. locations in this log. It also includes a number of features that
  504. improve robustness and performance. This section gives a brief
  505. overview of the implementation. Figure 3 shows a block diagram of the
  506. server.
  507. <p>
  508. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  509. <img src="Block.gif" ALT="block diagram">
  510. <p>
  511. Figure 3. A block diagram of the Venti prototype.
  512. <p>
  513. Since Venti is intended for archival storage, one goal of our
  514. prototype is robustness. The approach we have taken is to separate
  515. the storage of data blocks from the index used to locate a block. In
  516. particular, blocks are stored in an append-only log on a RAID array of
  517. disk drives. The simplicity of the append-only log structure
  518. eliminates many possible software errors that might cause data
  519. corruption and facilitates a variety of additional integrity
  520. strategies. A separate index structure allows a block to be
  521. efficiently located in the log; however, the index can be regenerated
  522. from the data log if required and thus does not have the same
  523. reliability constraints as the log itself.
  524. <p>
  525. The structure of the data log is illustrated in Figure 4. To ease
  526. maintenance, the log is divided into self-contained sections called
  527. arenas. Each arena contains a large number of data blocks and is
  528. sized to facilitate operations such as copying to removable media.
  529. Within an arena is a section for data bocks that is filled in an
  530. append-only manner. In Venti, data blocks are variable sized, up to a
  531. current limit of 52 Kbytes, but since blocks are immutable they can be
  532. densely packed into an arena without fragmentation.
  533. <p>
  534. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  535. <img src="LogFormat.gif" ALT="log format">
  536. <p>
  537. Figure 4. The format of the data log.
  538. <p>
  539. Each block is prefixed by a header that describes the contents of the
  540. block. The primary purpose of the header is to provide integrity
  541. checking during normal operation and to assist in data recovery. The
  542. header includes a magic number, the fingerprint and size of the block,
  543. the time when the block was first written, and identity of the user
  544. that wrote it. The header also includes a user-supplied type
  545. identifier, which is explained in Section 7. Note, only one copy of a
  546. given block is stored in the log, thus the user and wtime fields
  547. correspond to the first time the block was stored to the server.
  548. <p>
  549. Before storing a block in the log, an attempt is made to compress its
  550. contents. The inclusion of data compression increases the effective
  551. capacity of the archive and is simple to add given the log structure.
  552. Obviously, some blocks are incompressible. The encoding field in the
  553. block header indicates whether the data was compressed and, if so, the
  554. algorithm used. The esize field indicates the size of the data after
  555. compression, enabling the location of the next block in the arena to
  556. be determined. The downside of using compression is the computational
  557. cost, typically resulting in a decrease in the rate that blocks can be
  558. stored and retrieved. Our prototype uses a custom Lempel-Ziv '77 [21]
  559. algorithm that is optimized for speed. Compression is not a
  560. performance bottleneck for our existing server. Future
  561. implementations may benefit from hardware solutions.
  562. <p>
  563. In addition to a log of data blocks, an arena includes a header, a
  564. directory, and a trailer. The header identifies the arena. The
  565. directory contains a copy of the block header and offset for every
  566. block in the arena. By replicating the headers of all the blocks in
  567. one relatively small part of the arena, the server can rapidly check
  568. or rebuild the system's global block index. The directory also
  569. facilitates error recovery if part of the arena is destroyed or
  570. corrupted. The trailer summarizes the current state of the arena
  571. itself, including the number of blocks and the size of the log.
  572. Within the arena, the data log and the directory start at opposite
  573. ends and grow towards each other. When the arena is filled, it is
  574. marked as sealed, and a fingerprint is computed for the contents of
  575. the entire arena. Sealed arenas are never modified.
  576. <p>
  577. The basic operation of Venti is to store and retrieve blocks based on
  578. their fingerprints. A fingerprint is 160 bits long, and the number of
  579. possible fingerprints far exceeds the number of blocks stored on a
  580. server. The disparity between the number of fingerprints and blocks
  581. means it is impractical to map the fingerprint directly to a location
  582. on a storage device. Instead, we use an index to locate a block
  583. within the log.
  584. <p>
  585. We implement the index using a disk-resident hash table as illustrated
  586. in Figure 5. The index is divided into fixed-sized buckets, each of
  587. which is stored as a single disk block. Each bucket contains the
  588. index map for a small section of the fingerprint space. A hash
  589. function is used to map fingerprints to index buckets in a roughly
  590. uniform manner, and then the bucket is examined using binary search.
  591. If provisioned with sufficient buckets, the index hash table will be
  592. relatively empty and bucket overflows will be extremely rare. If a
  593. bucket does overflow, the extra entries are placed in an adjacent
  594. bucket. This structure is simple and efficient, requiring one disk
  595. access to locate a block in almost all cases.
  596. <p>
  597. <p>
  598. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  599. <img src="Index.gif" ALT="index format">
  600. <p>
  601. Figure 5. Format of the index.
  602. <p>
  603. The need to go through an index is the main performance penalty for
  604. Venti compared to a conventional block storage device. Our prototype
  605. uses three techniques to increase the performance: caching, striping,
  606. and write buffering.
  607. <p>
  608. The current implementation has two important caches of approximately
  609. equal size: a block cache and an index cache. A hit in the block
  610. cache returns the data for that fingerprint, bypassing the both the
  611. index lookup and access to the data log. Hits in the index cache
  612. eliminate only the index lookup, but the entries are much smaller and
  613. the hit rate correspondingly higher.
  614. <p>
  615. Unfortunately, these caches do not speed the process of storing a new
  616. block to Venti. The server must check that the block is not a
  617. duplicate by examining the index. If the block is not contained on
  618. the server, it will obviously not be in any cache. Since the
  619. fingerprint of the block contains no internal structure, the location
  620. of a fingerprint in the index is essentially random. Furthermore, the
  621. archival nature of Venti means the entire index will not fit in memory
  622. because of the large number of blocks. Combining these factors means
  623. that the write performance of Venti will be limited to the random IO
  624. performance of the index disk, which for current technology is a few
  625. hundred accesses per second. By striping the index across multiple
  626. disks, however, we get a linear speedup. This requires a sufficient
  627. number of concurrent accesses, which we assure by buffering the writes
  628. before accessing the index.
  629. <p>
  630. The prototype Venti server is implemented for the Plan 9 operating
  631. system in about 10,000 lines of C. The server runs on a dedicated dual
  632. 550Mhz Pentium III processor system with 2 Gbyte of memory and is
  633. accessed over a 100Mbs Ethernet network. The data log is stored on a
  634. 500 Gbyte MaxTronic IDE Raid 5 Array and the index resides on a string
  635. of 8 Seagate Cheetah 18XL 9 Gbyte SCSI drives.
  636. <p>
  637. <h1>6. Performance</h2>
  638. <p>
  639. Table 1 gives the preliminary performance results for read and write
  640. operations in a variety of situations. For comparison, we include the
  641. SCSI performance of the RAID array. Although the performance is still
  642. several times slower than directly accessing the disk, we believe the
  643. results are promising and will improve as the system matures.
  644. <p>
  645. Table 1. The performance of read and write operations in Mbytes/s for 8 Kbyte blocks.
  646. <p>
  647. <p>
  648. <table align=center>
  649. <tr>
  650. <th></th>
  651. <th width=150>sequential reads</th>
  652. <th width=150>random reads</th>
  653. <th width=150>virgin writes</th>
  654. <th width=150>duplicate writes</th>
  655. </tr>
  656. <tr>
  657. <td>uncached</td>
  658. <td align=center>0.9</td>
  659. <td align=center>0.4</td>
  660. <td align=center>3.7</td>
  661. <td align=center>5.6</td>
  662. </tr>
  663. <tr>
  664. <td>index cache</td>
  665. <td align=center>4.2</td>
  666. <td align=center>0.7</td>
  667. <td align=center>-</td>
  668. <td align=center>6.2</td>
  669. </tr>
  670. <tr>
  671. <td>block cache</td>
  672. <td align=center>6.8</td>
  673. <td align=center>-</td>
  674. <td align=center>-</td>
  675. <td align=center>6.5</td>
  676. </tr>
  677. <tr>
  678. <td>raw raid</td>
  679. <td align=center>14.8</td>
  680. <td align=center>1.0</td>
  681. <td align=center>12.4</td>
  682. <td align=center>12.4</td>
  683. </tr>
  684. </table>
  685. <p>
  686. The uncached sequential read performance is particularly bad. The
  687. problem is that these sequential reads require a random read of the
  688. index. Without assistance from the client, the read operations are
  689. not overlapped and do not benefit from the striping of the index. One
  690. possible solution is a form of read-ahead. When reading a block from
  691. the data log, it is feasible to also read several following blocks.
  692. These extra blocks can be added to the caches without referencing the
  693. index. If blocks are read in the same order they were written to the
  694. log, the latency of uncached index lookups will be avoided. This
  695. strategy should work well for streaming data such as multimedia files.
  696. <p>
  697. The basic assumption in Venti is that the growth in capacity of disks
  698. combined with the removal of duplicate blocks and compression of their
  699. contents enables a model in which it is not necessary to reclaim space
  700. by deleting archival data. To demonstrate why we believe this model
  701. is practical, we present some statistics derived from a decade's use
  702. of the Plan 9 file system.
  703. <p>
  704. The computing environment in which we work includes two Plan 9 file
  705. servers named bootes and emelie. Bootes was our primary file
  706. repository from 1990 until 1997 at which point it was superseded by
  707. emelie. Over the life of these two file servers there have been 522
  708. user accounts of which between 50 and 100 were active at any given
  709. time. The file servers have hosted numerous development projects and
  710. also contain several large data sets including chess end games,
  711. astronomical data, satellite imagery, and multimedia files.
  712. <p>
  713. Figure 6 depicts the size of the active file system as measured over
  714. time by du, the space consumed on the jukebox, and the size of the
  715. jukebox's data if it were to be stored on Venti. The ratio of the
  716. size of the archival data and the active file system is also given.
  717. As can be seen, even without using Venti, the storage required to
  718. implement the daily snapshots in Plan 9 is relatively modest, a result
  719. of the block level incremental approach to generating a snapshot.
  720. When the archival data is stored to Venti the cost of retaining the
  721. snapshots is reduced significantly. In the case of the emelie file
  722. system, the size on Venti is only slightly larger than the active file
  723. system; the cost of retaining the daily snapshots is almost zero.
  724. Note that the amount of storage that Venti uses for the snapshots
  725. would be the same even if more conventional methods were used to back
  726. up the file system. The Plan 9 approach to snapshots is not a
  727. necessity, since Venti will remove duplicate blocks.
  728. <p>
  729. <img src="bootes.gif" ALT="storage sizes for bootes">
  730. <img src="emelie.gif" ALT="storage sizes for emelie">
  731. <img src="bootes2.gif" ALT="ratio of sizes for bootes">
  732. <img src="emelie2.gif" ALT="ratio of sizes for emelie">
  733. <p>
  734. Figure 6. Graphs of the various sizes of two Plan 9 file servers.
  735. <p>
  736. When stored on Venti, the size of the jukebox data is reduced by three
  737. factors: elimination of duplicate blocks, elimination of block
  738. fragmentation, and compression of the block contents. Table 2
  739. presents the percent reduction for each of these factors. Note,
  740. bootes uses a 6 Kbyte block size while emelie uses 16 Kbyte, so the
  741. effect of removing fragmentation is more significant on emelie.
  742. <p>
  743. The 10 year history of the two Plan 9 file servers may be of interest
  744. to other researchers. We have made available per-block information
  745. including a hash of each block's contents, all the block pointers, and
  746. most of the directory information. The traces do not include the
  747. actual contents of files nor the file names. There is sufficient
  748. information to reconstruct the structure of the file system and to
  749. track the daily changes to this structure over time. The traces are
  750. available at http://www.cs.bell-labs.com/~seanq/p9trace.html.
  751. <p>
  752. Table 2. The percentage reduction in the size of data stored on
  753. Venti.
  754. <p>
  755. <table align=center>
  756. <tr>
  757. <th></th>
  758. <th width=150>bootes</th>
  759. <th width=150>emelie</th>
  760. </tr>
  761. <tr>
  762. <td>Elimination of duplicates</td>
  763. <td align=center>27.8%</td>
  764. <td align=center>31.3%</td>
  765. </tr>
  766. <tr>
  767. <td>Elimination of fragments</td>
  768. <td align=center>10.2%</td>
  769. <td align=center>25.4%</td>
  770. </tr>
  771. <tr>
  772. <td>Data Compression</td>
  773. <td align=center>33.8%</td>
  774. <td align=center>54.1%</td>
  775. </tr>
  776. <tr>
  777. <td>Total Reduction</td>
  778. <td align=center>59.7%</td>
  779. <td align=center>76.5%</td>
  780. </tr>
  781. </table>
  782. <p>
  783. <p>
  784. <h1>7. Reliability and Recovery</h1>
  785. <p>
  786. In concert with the development of the Venti prototype, we have built
  787. a collection of tools for integrity checking and error recovery.
  788. Example uses of these tools include: verifying the structure of an
  789. arena, checking there is an index entry for every block in the data
  790. log and vice versa, rebuilding the index from the data log, and
  791. copying an arena to removable media. These tools directly access the
  792. storage devices containing the data log and index and are executed on
  793. the server.
  794. <p>
  795. The directory structure at the end of each area enhances the
  796. efficiency of many integrity and recovery operations, since it is
  797. typically two orders of magnitude smaller than the arena, yet contains
  798. most of the needed information. The index checking utility, for
  799. example, is implemented as a disk based sort of all the arena
  800. directories, followed by a comparison between this sorted list and the
  801. index. Our prototype currently contains approximately 150 million
  802. blocks using 250 Gbytes of storage. An index check takes 2.2 hours,
  803. which is significantly less than the 6 hours it takes to read all the
  804. log data.
  805. <p>
  806. An additional integrity and recovery feature is the association of a
  807. type identifier with every block. This 8 bit identifier is included
  808. with all client read and write operations and has the effect of
  809. partitioning the server into multiple independent domains. The idea
  810. is that type indicates the interpretation of the data contained in the
  811. block. A client can use this feature, for example, to indicate that a
  812. block is the root node for a tree of blocks. Currently, the data
  813. format associated with a type is left entirely to the client; the
  814. server does not interpret the type other that to use it in conjunction
  815. with a fingerprint as the key with which to index a block.
  816. <p>
  817. One use of the type identifier is to assist the administrator in
  818. locating blocks for which a user has accidentally lost the
  819. fingerprint. Using a tool on the server, the data log can be scanned
  820. for blocks that match specified criteria, including the block type,
  821. the write time, and user identifier. The type makes it relatively
  822. simple to locate forgotten root blocks. Future uses for the type
  823. might include the ability for the server to determine the location of
  824. fingerprints within a block, enabling the server to traverse the data
  825. structures that have been stored.
  826. <p>
  827. By storing the data log on a RAID 5 disk array, our server is
  828. protected against single drive failures. Obviously, there are many
  829. scenarios where this is not sufficient: multiple drives may fail,
  830. there may be a fire in the machine room, the RAID firmware may contain
  831. bugs, or the device may be stolen.
  832. <p>
  833. Additional protection could be obtained by using one or more off-site
  834. mirrors for the server. We have not yet implemented this strategy,
  835. but the architecture of Venti makes this relatively simple. A
  836. background process on the server copies new blocks from the data log
  837. to the mirrors. This copying can be achieved using the Venti
  838. protocol; the server is simply another client to the mirror.
  839. <p>
  840. Even mirroring may not be sufficient. The implementation of Venti may
  841. contain bugs that can be exploited to compromise the server. An
  842. automated attack may delete data on many servers simultaneously.
  843. Storage devices that provide low level enforcement of a write-once
  844. policy would provide protection for such an attack. Write-once
  845. read-many optical jukeboxes often provide such protection, but this is
  846. not yet common for magnetic disk based storage systems. We have thus
  847. resorted to copying the sealed arenas onto removable media.
  848. <p>
  849. <h1>8. Related Work</h1>
  850. <p>
  851. The Stanford Archival Vault [2] is a prototype archival repository
  852. intended for digital libraries. The archive consists of a write-once
  853. log of digital objects (files) and several auxiliary indexes for
  854. locating objects within the log. Objects are identified by the hash
  855. of their contents using a cyclic redundancy check (CRC). Unlike
  856. Venti, this system has no way to share data between objects that are
  857. partially the same, or to build up complex data structures such as a
  858. file system hierarchy. Rather, the archive consists of a collection
  859. of separate objects with a limited ability to group objects into sets.
  860. <p>
  861. On Venti, blocks are organized into more complex data structures by
  862. creating hash-trees, an idea originally proposed by Merkle [11] for an
  863. efficient digital signature scheme.
  864. <p>
  865. The approach to block retrieval in the Read-Only Secure File System
  866. (SFSRO) [3] is comparable to Venti. Blocks are identified by the Sha1
  867. hash of their contents and this idea is applied recursively to build
  868. up more complex structures. The focus of this system is security, not
  869. archival storage. An administrator creates a digitally signed
  870. database offline. The database contains a public read-only file
  871. system that can be published on multiple servers and efficiently and
  872. securely accessed by clients. SFSRO outperforms traditional methods
  873. for providing data integrity between a client and a file server,
  874. demonstrating an attractive property of hash-based addressing.
  875. <p>
  876. Given their similarities, it would be simple to implement SFSRO on top
  877. of Venti. The goal of Venti is to provide a flexible location for
  878. archival storage and SFSRO is a good example of an application that
  879. could use this capability. In fact, using Venti would provide a
  880. trivial solution to SFSRO's problem with stale NFS handles since data
  881. is never deleted from Venti and thus a stale handle will never be
  882. encountered.
  883. <p>
  884. Content-Derived Names [6] are another example of naming objects based
  885. on a secure hash of its contents. This work addresses the issue of
  886. naming and managing the various binary software components, in
  887. particular shared libraries, that make up an application.
  888. <p>
  889. The philosophy of the Elephant file system [18] is similar to Venti;
  890. large, cheap disks make it feasible to retain many versions of data.
  891. A feature of the Elephant system is the ability to specify a variety
  892. of data retention policies, which can be applied to individual files
  893. or directories. These policies attempt to strike a balance between
  894. the costs and benefits of storing every version of a file. In
  895. contrast, Venti focuses on the problem of how to store information
  896. after deciding that it should be retained in perpetuity. A system
  897. such as the Elephant file system could incorporate Venti as the
  898. storage device for the permanent "landmark" versions of files, much as
  899. the Plan 9 file system will use Venti to archive snapshots.
  900. <p>
  901. Self-Securing Storage [19] retains all versions of file system data in
  902. order to provide diagnosis and recovery from security breaches. The
  903. system is implemented as a self-contained network service that exports
  904. an object-based disk interface, providing protection from compromise
  905. of the client operating system. Old data is retained for a window of
  906. time and then deleted to reclaim storage.
  907. <p>
  908. Venti provides many of the features of self-securing storage: the
  909. server is self-contained and accessed through a simple low-level
  910. protocol, malicious users cannot corrupt or delete existing data on
  911. the server, and old versions of data are available for inspection. It
  912. is unlikely that a system would write every file system operation to
  913. Venti since storage is never reclaimed, but not deleting data removes
  914. the constraint that an intrusion must be detected within a limited
  915. window of time. A hybrid approach might retain every version for some
  916. time and some versions for all time. Venti could provide the
  917. long-term storage for such a hybrid.
  918. <p>
  919. <h1>9. Future Work</h1>
  920. <p>
  921. Venti could be distributed across multiple machines; the approach of
  922. identifying data by a hash of its contents simplifies such an
  923. extension. For example, the IO performance could be improved by
  924. replicating the server and using a simple load balancing algorithm.
  925. When storing or retrieving a block, clients direct the operation to a
  926. server based on a few bits of the fingerprint. Such load balancing
  927. could even be hidden from the client application by interposing a
  928. proxy server that performs this operation on behalf of the client.
  929. <p>
  930. Today, Venti provides little security. After authenticating to the
  931. server, clients can read any block for which they know the
  932. fingerprint. A fingerprint does act as a capability since the space
  933. of fingerprints is large and the Venti protocol does not include a
  934. means of enumerating the blocks on the server. However, this
  935. protection is weak as a single root fingerprint enables access to an
  936. entire file tree and once a fingerprint is known, there is no way to
  937. restrict access to a particular user. We are exploring ways of
  938. providing better access control.
  939. <p>
  940. To date, the structures we have used for storing data on Venti break
  941. files into a series of fixed sized blocks. Identical blocks are
  942. consolidated on Venti, but this consolidation will not occur if the
  943. data is shifted within the file or an application uses a different
  944. block size. This limitation can be overcome using an adaptation of
  945. Manber's algorithm for finding similarities in files [9]. The idea is
  946. to break files into variable sized blocks based on the identification
  947. of anchor or break points, increasing the occurrence of duplicate
  948. blocks [12]. Such a strategy can be implemented in client
  949. applications with no change to the Venti server.
  950. <p>
  951. A more detailed analysis of the decade of daily snapshots of the Plan
  952. 9 file systems might be interesting. The trace data we have made
  953. publicly available contains approximately the same information used
  954. for other studies of long term file activity [4].
  955. <p>
  956. <h1>10. Conclusion</h1>
  957. <p>
  958. The approach of identifying a block by the Sha1 hash of its contents
  959. is well suited to archival storage. The write-once model and the
  960. ability to coalesce duplicate copies of a block makes Venti a useful
  961. building block for a number of interesting storage applications.
  962. <p>
  963. The large capacity of magnetic disks allows archival data to be
  964. retained and available on-line with performance that is comparable to
  965. conventional disks. Stored on our prototype server is over a decade
  966. of daily snapshots of two major departmental file servers. These
  967. snapshots are stored in a little over 200 Gbytes of disk space.
  968. Today, 100 Gbytes drives cost less than $300 and IDE RAID controllers
  969. are included on many motherboards. A scaled down version of our
  970. server could provide archival storage for a home user at an attractive
  971. price. Tomorrow, when terabyte disks can be had for the same price,
  972. it seems unlikely that archival data will be deleted to reclaim space.
  973. Venti provides an attractive approach to storing that data.
  974. <p>
  975. <h1>11. Acknowledgments</h1>
  976. <p>
  977. This paper was improved by comments and suggestions from Peter Bosch,
  978. Eric Grosse, Lorenz Huelsbergen, Rob Pike, Ross Quinlan, and Cliff
  979. Young and six anonymous reviewers. The paper's shepherd was Ethan L.
  980. Miller. We thank them all for their help.
  981. <p>
  982. <h1>12. References</h1>
  983. <p>
  984. [1] Ann Chervenak, Vivekenand Vellanki, and Zachary Kurmas.
  985. Protecting file systems: A survey of backup techniques. In
  986. Proceedings Joint NASA and IEEE Mass Storage Conference, March 1998.
  987. <p>
  988. [2] Arturo Crespo and Hector Garcia-Molina. Archival storage for
  989. digital libraries. In Proceedings of the 3rd ACM International
  990. Conference on Digital Libraries, 1998.
  991. <p>
  992. [3] Kevin Fu, Frans Kaashoek, and David Mazières. Fast and secure
  993. distributed read-only file system. In Proceedings of the 4th
  994. Symposium on Operating Systems Design and Implementation, 2000.
  995. <p>
  996. [4] Timothy J. Gibson, Ethan L. Miller, and Darrell D. E. Long.
  997. Long-term file activity and inter-reference patterns. In Proceedings,
  998. 24th International Conference on Technology Management and Performance
  999. Evaluation of Enterprise-Wide Information Systems, Computer
  1000. Measurement Group, December 1998.
  1001. <p>
  1002. [5] Dave Hitz, James Lau, and Michael Malcolm, File system design for
  1003. an NFS file server appliance, In Proceedings of the Winter 1994 USENIX
  1004. Conference, San Francisco, CA, January 1994.
  1005. <p>
  1006. [6] J. K. Hollingsworth and E. L. Miller. Using content-derived names
  1007. for configuration management. In Proceeding of the 1997 ACM Symposium
  1008. on Software Reusability, Boston, May 1997.
  1009. <p>
  1010. [7] John Howard, Michael Kazar, Sherri Menees, David Nichols, Mahadev
  1011. Satyanarayanan, Robert Sidebotham, and Michael West. Scale and
  1012. performance in a distributed file system. ACM Transactions on
  1013. Computer Systems, 6(1):51-81, February 1988.
  1014. <p>
  1015. [8] Norman C. Hutchinson, Stephen Manley, Mike Federwisch, Guy Harris,
  1016. Dave Hitz, Steven Kleiman, and Sean O'Malley. Logical vs. physical
  1017. file system backup. In Proceedings of the 3rd USENIX Symposium on
  1018. Operating Systems Design and Implementation (OSDI), 1999.
  1019. <p>
  1020. [9] Udi Manber. Finding similar files in a large file system. In
  1021. Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA,
  1022. January 1994.
  1023. <p>
  1024. [10] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone.
  1025. Handbook of Applied Cryptography. CRC Press, 1996.
  1026. <p>
  1027. [11] Ralph C. Merkle. Protocols for public-key cryptosystems. In
  1028. Proceedings of the IEEE Symposium on Security and Privacy, pp.
  1029. 122-133, April 1980.
  1030. <p>
  1031. [12] Athicha Muthitacharoen, Benjie Chen, and David Mazières. A
  1032. low-bandwidth network file system. In Proceedings of the 18th
  1033. Symposium on Operating Systems Principles, October 2001.
  1034. <p>
  1035. [13] National Institute of Standards and Technology, FIPS 180-1.
  1036. Secure Hash Standard. US Department of Commerce, April 1995.
  1037. <p>
  1038. [14] National Institute of Standards and Technology, Draft FIPS 180-2.
  1039. Secure Hash Standard. US Department of Commerce, May 2001.
  1040. <p>
  1041. [15] Evi Nemeth, Garth Snyder, Scott Seebass, and Trent R. Hein. UNIX
  1042. System Administration Handbook 3rd Edition, Prentice Hall, 2001.
  1043. <p>
  1044. [16] Rob Pike, Dave Presotto, Sean Dorward, Bob Flandrena, Ken
  1045. Thompson, Howard Trickey, and Phil Winterbottom. Plan 9 from Bell
  1046. Labs, Computing Systems, Vol. 8, 3, pp. 221-254, Summer 1995.
  1047. <p>
  1048. [17] Sean Quinlan. A cache worm file system. Software-Practice and
  1049. Experience, Vol 21, 12, pp 1289-1299, December 1991.
  1050. <p>
  1051. [18] Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson,
  1052. Alistair C. Veitch, Ross W. Carton and Jacob Ofir. Deciding when to
  1053. forget in the Elephant file system. In Proceedings of the 17th
  1054. Symposium on Operating Systems Principles, December 12-15, 1999.
  1055. <p>
  1056. [19] John. D. Strunk, Garth R. Goodson, Michael L. Scheinholtz, Craig
  1057. A.N. Soules, and Gregory R. Ganger. Self-securing storage: protecting
  1058. data in compromised systems. In Proceedings of the 4th Symposium on
  1059. Operating Systems Design and Implementation, October 2000.
  1060. <p>
  1061. [20] D. A. Thompson and J. S. Best. The future of magnetic data
  1062. storage technology, IBM Journal of Research and Development, Vol 44,
  1063. 3, pp. 311-322, May 2000.
  1064. <p>
  1065. [21] J. Ziv and A. Lempel. A universal algorithm for sequential data
  1066. compression, IEEE Trans. Inform. Theory, vol. IT-23, pp. 337-343,
  1067. May 1977.
  1068. <p>