venti 12 KB


  1. .TH VENTI 8
  2. .SH NAME
  3. venti \- archival storage server
  4. .SH SYNOPSIS
  5. .in +0.25i
  6. .ti -0.25i
  7. .B venti/venti
  8. [
  9. .B -Ldrs
  10. ]
  11. [
  12. .B -a
  13. .I address
  14. ]
  15. [
  16. .B -B
  17. .I blockcachesize
  18. ]
  19. [
  20. .B -c
  21. .I config
  22. ]
  23. [
  24. .B -C
  25. .I lumpcachesize
  26. ]
  27. [
  28. .B -h
  29. .I httpaddress
  30. ]
  31. [
  32. .B -I
  33. .I indexcachesize
  34. ]
  35. [
  36. .B -W
  37. .I webroot
  38. ]
  39. .SH DESCRIPTION
  40. Venti
  41. is a SHA1-addressed archival storage server.
  42. See
  43. .IR venti (8)
  44. for a full introduction to the system.
  45. This page documents the structure and operation of the server.
  46. .PP
  47. A venti server requires multiple disks or disk partitions,
  48. each of which must be properly formatted before the server
  49. can be run.
  50. .SS Disk
  51. The venti server maintains three disk structures, typically
  52. stored on raw disk partitions:
  53. the append-only
  54. .IR "data log" ,
  55. which holds, in sequential order,
  56. the contents of every block written to the server;
  57. the
  58. .IR index ,
  59. which helps locate a block in the data log given its score;
  60. and optionally the
  61. .IR "bloom filter" ,
  62. a concise summary of which scores are present in the index.
  63. The data log is the primary storage.
  64. To improve the robustness, it should be stored on
  65. a device that provides RAID functionality.
  66. The index and the bloom filter are optimizations
  67. employed to access the data log efficiently and can be rebuilt
  68. if lost or damaged.
  69. .PP
  70. The data log is logically split into sections called
  71. .IR arenas ,
  72. typically sized for easy offline backup
  73. (e.g., 500MB).
  74. A data log may comprise many disks, each storing
  75. one or more arenas.
  76. Such disks are called
  77. .IR "arena partitions" .
  78. Arena partitions are filled in the order given in the configuration.
  79. .PP
  80. The index is logically split into block-sized pieces called
  81. .IR buckets ,
  82. each of which is responsible for a particular range of scores.
  83. An index may be split across many disks, each storing many buckets.
  84. Such disks are called
  85. .IR "index sections" .
  86. .PP
  87. The index must be sized so that no bucket is full.
  88. When a bucket fills, the server must be shut down and
  89. the index made larger.
  90. Since scores appear random, each bucket will contain
  91. approximately the same number of entries.
  92. Index entries are 40 bytes long. Assuming that a typical block
  93. being written to the server is 8192 bytes and compresses to 4096
  94. bytes, the active index is expected to be about 1% of
  95. the active data log.
  96. Storing smaller blocks increases the relative index footprint;
  97. storing larger blocks decreases it.
  98. To allow variation in both block size and the random distribution
  99. of scores to buckets, the suggested index size is 5% of
  100. the active data log.
  101. .PP
  102. The (optional) bloom filter is a large bitmap that is stored on disk but
  103. also kept completely in memory while the venti server runs.
  104. It helps the venti server efficiently detect scores that are
  105. .I not
  106. already stored in the index.
  107. The bloom filter starts out zeroed.
  108. Each score recorded in the bloom filter is hashed to choose
  109. .I nhash
  110. bits to set in the bloom filter.
  111. A score is definitely not stored in the index of any of its
  112. .I nhash
  113. bits are not set.
  114. The bloom filter thus has two parameters:
  115. .I nhash
  116. (maximum 32)
  117. and the total bitmap size
  118. (maximum 512MB, 2\s-2\u32\d\s+2 bits).
  119. .PP
  120. The bloom filter should be sized so that
  121. .I nhash
  122. \(mu
  123. .I nblock
  124. \(<=
  125. 0.7 \(mu
  126. .IR b ,
  127. where
  128. .I nblock
  129. is the expected number of blocks stored on the server
  130. and
  131. .I b
  132. is the bitmap size in bits.
  133. The false positive rate of the bloom filter when sized
  134. this way is approximately 2\s-2\u\-\fInblock\fR\d\s+2.
  135. .I Nhash
  136. less than 10 are not very useful;
  137. .I nhash
  138. greater than 24 are probably a waste of memory.
  139. .I Fmtbloom
  140. (see
  141. .IR venti-fmt (8))
  142. can be given either
  143. .I nhash
  144. or
  145. .IR nblock ;
  146. if given
  147. .IR nblock ,
  148. it will derive an appropriate
  149. .IR nhash .
  150. .SS Memory
  151. Venti can make effective use of large amounts of memory
  152. for various caches.
  153. .PP
  154. The
  155. .I "lump cache
  156. holds recently-accessed venti data blocks, which the server refers to as
  157. .IR lumps .
  158. The lump cache should be at least 1MB but can profitably be much larger.
  159. The lump cache can be thought of as the level-1 cache:
  160. read requests handled by the lump cache can
  161. be served instantly.
  162. .PP
  163. The
  164. .I "block cache
  165. holds recently-accessed
  166. .I disk
  167. blocks from the arena partitions.
  168. The block cache needs to be able to simultaneously hold two blocks
  169. from each arena plus four blocks for the currently-filling arena.
  170. The block cache can be thought of as the level-2 cache:
  171. read requests handled by the block cache are slower than those
  172. handled by the lump cache, since the lump data must be extracted
  173. from the raw disk blocks and possibly decompressed, but no
  174. disk accesses are necessary.
  175. .PP
  176. The
  177. .I "index cache
  178. holds recently-accessed or prefetched
  179. index entries.
  180. The index cache needs to be able to hold index entries
  181. for three or four arenas, at least, in order for prefetching
  182. to work properly. Each index entry is 50 bytes.
  183. Assuming 500MB arenas of
  184. 128,000 blocks that are 4096 bytes each after compression,
  185. the minimum index cache size is about 6MB.
  186. The index cache can be thought of as the level-3 cache:
  187. read requests handled by the index cache must still go
  188. to disk to fetch the arena blocks, but the costly random
  189. access to the index is avoided.
  190. .PP
  191. The size of the index cache determines how long venti
  192. can sustain its `burst' write throughput, during which time
  193. the only disk accesses on the critical path
  194. are sequential writes to the arena partitions.
  195. For example, if you want to be able to sustain 10MB/s
  196. for an hour, you need enough index cache to hold entries
  197. for 36GB of blocks. Assuming 8192-byte blocks,
  198. you need room for almost five million index entries.
  199. Since index entries are 50 bytes each, you need 250MB
  200. of index cache.
  201. If the background index update process can make a single
  202. pass through the index in an hour, which is possible,
  203. then you can sustain the 10MB/s indefinitely (at least until
  204. the arenas are all filled).
  205. .PP
  206. The
  207. .I "bloom filter
  208. requires memory equal to its size on disk,
  209. as discussed above.
  210. .PP
  211. A reasonable starting allocation is to
  212. divide memory equally (in thirds) between
  213. the bloom filter, the index cache, and the lump and block caches;
  214. the third of memory allocated to the lump and block caches
  215. should be split unevenly, with more (say, two thirds)
  216. going to the block cache.
  217. .SS Network
  218. The venti server announces two network services, one
  219. (conventionally TCP port
  220. .BR venti ,
  221. 17034) serving
  222. the venti protocol as described in
  223. .IR venti (6),
  224. and one serving HTTP
  225. (conventionally TCP port
  226. .BR venti ,
  227. 80).
  228. .PP
  229. The venti web server provides the following
  230. URLs for accessing status information:
  231. .TP
  232. .B /index
  233. A summary of the usage of the arenas and index sections.
  234. .TP
  235. .B /xindex
  236. An XML version of
  237. .BR /index .
  238. .TP
  239. .B /storage
  240. Brief storage totals.
  241. .TP
  242. .BI /set/ variable
  243. The current integer value of
  244. .IR variable .
  245. Variables are:
  246. .BR compress ,
  247. whether or not to compress blocks
  248. (for debugging);
  249. .BR logging ,
  250. whether to write entries to the debugging logs;
  251. .BR stats ,
  252. whether to collect run-time statistics;
  253. .BR icachesleeptime ,
  254. the time in milliseconds between successive updates
  255. of megabytes of the index cache;
  256. .BR arenasumsleeptime ,
  257. the time in milliseconds between reads while
  258. checksumming an arena in the background.
  259. The two sleep times should be (but are not) managed by venti;
  260. they exist to provide more experience with their effects.
  261. The other variables exist only for debugging and
  262. performance measurement.
  263. .TP
  264. .BI /set/ variable / value
  265. Set
  266. .I variable
  267. to
  268. .IR value .
  269. .TP
  270. .BI /graph/ name / param / param / \fR...
  271. A PNG image graphing the named run-time statistic over time.
  272. The details of names and parameters are undocumented;
  273. see
  274. .B httpd.c
  275. in the venti sources.
  276. .TP
  277. .B /log
  278. A list of all debugging logs present in the server's memory.
  279. .TP
  280. .BI /log/ name
  281. The contents of the debugging log with the given
  282. .IR name .
  283. .TP
  284. .B /flushicache
  285. Force venti to begin flushing the index cache to disk.
  286. The request response will not be sent until the flush
  287. has completed.
  288. .TP
  289. .B /flushdcache
  290. Force venti to begin flushing the arena block cache to disk.
  291. The request response will not be sent until the flush
  292. has completed.
  293. .PD
  294. .PP
  295. Requests for other files are served by consulting a
  296. directory named in the configuration file
  297. (see
  298. .B webroot
  299. below).
  300. .SS Configuration File
  301. A venti configuration file
  302. enumerates the various index sections and
  303. arenas that constitute a venti system.
  304. The components are indicated by the name of the file, typically
  305. a disk partition, in which they reside. The configuration
  306. file is the only location that file names are used. Internally,
  307. venti uses the names assigned when the components were formatted
  308. with
  309. .I fmtarenas
  310. or
  311. .I fmtisect
  312. (see
  313. .IR venti-fmt (8)).
  314. In particular, only the configuration needs to be
  315. changed if a component is moved to a different file.
  316. .PP
  317. The configuration file consists of lines in the form described below.
  318. Lines starting with
  319. .B #
  320. are comments.
  321. .TP
  322. .BI index " name
  323. Names the index for the system.
  324. .TP
  325. .BI arenas " file
  326. .I File
  327. is an arena partition, formatted using
  328. .IR fmtarenas .
  329. .TP
  330. .BI isect " file
  331. .I File
  332. is an index section, formatted using
  333. .IR fmtisect .
  334. .TP
  335. .BI bloom " file
  336. .I File
  337. is a bloom filter, formatted using
  338. .IR fmtbloom .
  339. .PD
  340. .PP
  341. After formatting a venti system using
  342. .IR fmtindex ,
  343. the order of arenas and index sections should not be changed.
  344. Additional arenas can be appended to the configuration;
  345. run
  346. .I fmtindex
  347. with the
  348. .B -a
  349. flag to update the index.
  350. .PP
  351. The configuration file also holds configuration parameters
  352. for the venti server itself.
  353. These are:
  354. .TF httpaddr netaddr
  355. .TP
  356. .BI mem " size
  357. lump cache size
  358. .TP
  359. .BI bcmem " size
  360. block cache size
  361. .TP
  362. .BI icmem " size
  363. index cache size
  364. .TP
  365. .BI addr " netaddr
  366. network address to announce venti service
  367. (default
  368. .BR tcp!*!venti )
  369. .TP
  370. .BI httpaddr " netaddr
  371. network address to announce HTTP service
  372. (default
  373. .BR tcp!*!http )
  374. .TP
  375. .B queuewrites
  376. queue writes in memory
  377. (default is not to queue)
  378. .TP
  379. .BI webroot " dir
  380. directory tree containing files for HTTP server
  381. to consult for unrecognized URLs
  382. .PD
  383. .PP
  384. The units for the various cache sizes above can be specified by appending a
  385. .LR k ,
  386. .LR m ,
  387. or
  388. .LR g
  389. (case-insensitive)
  390. to indicate kilobytes, megabytes, or gigabytes respectively.
  391. .PP
  392. The
  393. .I file
  394. name in the configuration lines above can be of the form
  395. .IB file : lo - hi
  396. to specify a range of the file.
  397. .I Lo
  398. and
  399. .I hi
  400. are specified in bytes but can have the usual
  401. .BI k ,
  402. .BI m ,
  403. or
  404. .B g
  405. suffixes.
  406. Either
  407. .I lo
  408. or
  409. .I hi
  410. may be omitted.
  411. This notation eliminates the need to
  412. partition raw disks on non-Plan 9 systems.
  413. .SS Command Line
  414. Many of the options to Venti duplicate parameters that
  415. can be specified in the configuration file.
  416. The command line options override those found in a
  417. configuration file.
  418. Additional options are:
  419. .TP
  420. .BI -c " config
  421. The server configuration file
  422. (default
  423. .BR venti.conf )
  424. .TP
  425. .B -d
  426. Produce various debugging information on standard error.
  427. Implies
  428. .BR -s .
  429. .TP
  430. .B -L
  431. Enable logging. By default all logging is disabled.
  432. Logging slows server operation considerably.
  433. .TP
  434. .B -r
  435. Allow only read access to the venti data.
  436. .TP
  437. .B -s
  438. Do not run in the background.
  439. Normally,
  440. the foreground process will exit once the Venti server
  441. is initialized and ready for connections.
  442. .PD
  443. .SH EXAMPLE
  444. A simple configuration:
  445. .IP
  446. .EX
  447. % cat venti.conf
  448. index main
  449. isect /tmp/disks/isect0
  450. isect /tmp/disks/isect1
  451. arenas /tmp/disks/arenas
  452. bloom /tmp/disks/bloom
  453. mem 10M
  454. bcmem 20M
  455. icmem 30M
  456. %
  457. .EE
  458. .PP
  459. Format the index sections, the arena partition, and
  460. finally the main index:
  461. .IP
  462. .EX
  463. % venti/fmtisect isect0. /tmp/disks/isect0 &
  464. % venti/fmtisect isect1. /tmp/disks/isect1 &
  465. % venti/fmtarenas arenas0. /tmp/disks/arenas &
  466. % venti/fmtbloom /tmp/disks/bloom &
  467. % wait
  468. % venti/fmtindex venti.conf
  469. %
  470. .EE
  471. .PP
  472. Start the server and check the storage statistics:
  473. .IP
  474. .EX
  475. % venti/venti
  476. % hget http://$sysname/storage
  477. .EE
  478. .SH SOURCE
  479. .B /sys/src/cmd/venti/srv
  480. .SH "SEE ALSO"
  481. .IR venti (1),
  482. .IR venti (2),
  483. .IR venti (6),
  484. .IR venti-backup (8)
  485. .IR venti-fmt (8)
  486. .br
  487. Sean Quinlan and Sean Dorward,
  488. ``Venti: a new approach to archival storage'',
  489. .I "Usenix Conference on File and Storage Technologies" ,
  490. 2002.
  491. .SH BUGS
  492. Setting up a venti server is too complicated.
  493. .PP
  494. Venti should not require the user to decide how to
  495. partition its memory usage.