123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495 |
- .TH VENTI 8
- .SH NAME
- venti \- archival storage server
- .SH SYNOPSIS
- .in +0.25i
- .ti -0.25i
- .B venti/venti
- [
- .B -Ldrs
- ]
- [
- .B -a
- .I address
- ]
- [
- .B -B
- .I blockcachesize
- ]
- [
- .B -c
- .I config
- ]
- [
- .B -C
- .I lumpcachesize
- ]
- [
- .B -h
- .I httpaddress
- ]
- [
- .B -I
- .I indexcachesize
- ]
- [
- .B -W
- .I webroot
- ]
- .SH DESCRIPTION
- Venti
- is a SHA1-addressed archival storage server.
- See
- .IR venti (6)
- for a full introduction to the system.
- This page documents the structure and operation of the server.
- .PP
- A venti server requires multiple disks or disk partitions,
- each of which must be properly formatted before the server
- can be run.
- .SS Disk
- The venti server maintains three disk structures, typically
- stored on raw disk partitions:
- the append-only
- .IR "data log" ,
- which holds, in sequential order,
- the contents of every block written to the server;
- the
- .IR index ,
- which helps locate a block in the data log given its score;
- and optionally the
- .IR "bloom filter" ,
- a concise summary of which scores are present in the index.
- The data log is the primary storage.
- To improve the robustness, it should be stored on
- a device that provides RAID functionality.
- The index and the bloom filter are optimizations
- employed to access the data log efficiently and can be rebuilt
- if lost or damaged.
- .PP
- The data log is logically split into sections called
- .IR arenas ,
- typically sized for easy offline backup
- (e.g., 500MB).
- A data log may comprise many disks, each storing
- one or more arenas.
- Such disks are called
- .IR "arena partitions" .
- Arena partitions are filled in the order given in the configuration.
- .PP
- The index is logically split into block-sized pieces called
- .IR buckets ,
- each of which is responsible for a particular range of scores.
- An index may be split across many disks, each storing many buckets.
- Such disks are called
- .IR "index sections" .
- .PP
- The index must be sized so that no bucket is full.
- When a bucket fills, the server must be shut down and
- the index made larger.
- Since scores appear random, each bucket will contain
- approximately the same number of entries.
- Index entries are 40 bytes long. Assuming that a typical block
- being written to the server is 8192 bytes and compresses to 4096
- bytes, the active index is expected to be about 1% of
- the active data log.
- Storing smaller blocks increases the relative index footprint;
- storing larger blocks decreases it.
- To allow variation in both block size and the random distribution
- of scores to buckets, the suggested index size is 5% of
- the active data log.
- .PP
- The (optional) bloom filter is a large bitmap that is stored on disk but
- also kept completely in memory while the venti server runs.
- It helps the venti server efficiently detect scores that are
- .I not
- already stored in the index.
- The bloom filter starts out zeroed.
- Each score recorded in the bloom filter is hashed to choose
- .I nhash
- bits to set in the bloom filter.
- A score is definitely not stored in the index of any of its
- .I nhash
- bits are not set.
- The bloom filter thus has two parameters:
- .I nhash
- (maximum 32)
- and the total bitmap size
- (maximum 512MB, 2\s-2\u32\d\s+2 bits).
- .PP
- The bloom filter should be sized so that
- .I nhash
- \(mu
- .I nblock
- \(<=
- 0.7 \(mu
- .IR b ,
- where
- .I nblock
- is the expected number of blocks stored on the server
- and
- .I b
- is the bitmap size in bits.
- The false positive rate of the bloom filter when sized
- this way is approximately 2\s-2\u\-\fInblock\fR\d\s+2.
- .I Nhash
- less than 10 are not very useful;
- .I nhash
- greater than 24 are probably a waste of memory.
- .I Fmtbloom
- (see
- .IR venti-fmt (8))
- can be given either
- .I nhash
- or
- .IR nblock ;
- if given
- .IR nblock ,
- it will derive an appropriate
- .IR nhash .
- .SS Memory
- Venti can make effective use of large amounts of memory
- for various caches.
- .PP
- The
- .I "lump cache
- holds recently-accessed venti data blocks, which the server refers to as
- .IR lumps .
- The lump cache should be at least 1MB but can profitably be much larger.
- The lump cache can be thought of as the level-1 cache:
- read requests handled by the lump cache can
- be served instantly.
- .PP
- The
- .I "block cache
- holds recently-accessed
- .I disk
- blocks from the arena partitions.
- The block cache needs to be able to simultaneously hold two blocks
- from each arena plus four blocks for the currently-filling arena.
- The block cache can be thought of as the level-2 cache:
- read requests handled by the block cache are slower than those
- handled by the lump cache, since the lump data must be extracted
- from the raw disk blocks and possibly decompressed, but no
- disk accesses are necessary.
- .PP
- The
- .I "index cache
- holds recently-accessed or prefetched
- index entries.
- The index cache needs to be able to hold index entries
- for three or four arenas, at least, in order for prefetching
- to work properly. Each index entry is 50 bytes.
- Assuming 500MB arenas of
- 128,000 blocks that are 4096 bytes each after compression,
- the minimum index cache size is about 6MB.
- The index cache can be thought of as the level-3 cache:
- read requests handled by the index cache must still go
- to disk to fetch the arena blocks, but the costly random
- access to the index is avoided.
- .PP
- The size of the index cache determines how long venti
- can sustain its `burst' write throughput, during which time
- the only disk accesses on the critical path
- are sequential writes to the arena partitions.
- For example, if you want to be able to sustain 10MB/s
- for an hour, you need enough index cache to hold entries
- for 36GB of blocks. Assuming 8192-byte blocks,
- you need room for almost five million index entries.
- Since index entries are 50 bytes each, you need 250MB
- of index cache.
- If the background index update process can make a single
- pass through the index in an hour, which is possible,
- then you can sustain the 10MB/s indefinitely (at least until
- the arenas are all filled).
- .PP
- The
- .I "bloom filter
- requires memory equal to its size on disk,
- as discussed above.
- .PP
- A reasonable starting allocation is to
- divide memory equally (in thirds) between
- the bloom filter, the index cache, and the lump and block caches;
- the third of memory allocated to the lump and block caches
- should be split unevenly, with more (say, two thirds)
- going to the block cache.
- .SS Network
- The venti server announces two network services, one
- (conventionally TCP port
- .BR venti ,
- 17034) serving
- the venti protocol as described in
- .IR venti (6),
- and one serving HTTP
- (conventionally TCP port
- .BR venti ,
- 80).
- .PP
- The venti web server provides the following
- URLs for accessing status information:
- .TP
- .B /index
- A summary of the usage of the arenas and index sections.
- .TP
- .B /xindex
- An XML version of
- .BR /index .
- .TP
- .B /storage
- Brief storage totals.
- .TP
- .BI /set/ variable
- The current integer value of
- .IR variable .
- Variables are:
- .BR compress ,
- whether or not to compress blocks
- (for debugging);
- .BR logging ,
- whether to write entries to the debugging logs;
- .BR stats ,
- whether to collect run-time statistics;
- .BR icachesleeptime ,
- the time in milliseconds between successive updates
- of megabytes of the index cache;
- .BR arenasumsleeptime ,
- the time in milliseconds between reads while
- checksumming an arena in the background.
- The two sleep times should be (but are not) managed by venti;
- they exist to provide more experience with their effects.
- The other variables exist only for debugging and
- performance measurement.
- .TP
- .BI /set/ variable / value
- Set
- .I variable
- to
- .IR value .
- .TP
- .BI /graph/ name / param / param / \fR...
- A PNG image graphing the named run-time statistic over time.
- The details of names and parameters are undocumented;
- see
- .B httpd.c
- in the venti sources.
- .TP
- .B /log
- A list of all debugging logs present in the server's memory.
- .TP
- .BI /log/ name
- The contents of the debugging log with the given
- .IR name .
- .TP
- .B /flushicache
- Force venti to begin flushing the index cache to disk.
- The request response will not be sent until the flush
- has completed.
- .TP
- .B /flushdcache
- Force venti to begin flushing the arena block cache to disk.
- The request response will not be sent until the flush
- has completed.
- .PD
- .PP
- Requests for other files are served by consulting a
- directory named in the configuration file
- (see
- .B webroot
- below).
- .SS Configuration File
- A venti configuration file
- enumerates the various index sections and
- arenas that constitute a venti system.
- The components are indicated by the name of the file, typically
- a disk partition, in which they reside. The configuration
- file is the only location that file names are used. Internally,
- venti uses the names assigned when the components were formatted
- with
- .I fmtarenas
- or
- .I fmtisect
- (see
- .IR venti-fmt (8)).
- In particular, only the configuration needs to be
- changed if a component is moved to a different file.
- .PP
- The configuration file consists of lines in the form described below.
- Lines starting with
- .B #
- are comments.
- .TP
- .BI index " name
- Names the index for the system.
- .TP
- .BI arenas " file
- .I File
- is an arena partition, formatted using
- .IR fmtarenas .
- .TP
- .BI isect " file
- .I File
- is an index section, formatted using
- .IR fmtisect .
- .TP
- .BI bloom " file
- .I File
- is a bloom filter, formatted using
- .IR fmtbloom .
- .PD
- .PP
- After formatting a venti system using
- .IR fmtindex ,
- the order of arenas and index sections should not be changed.
- Additional arenas can be appended to the configuration;
- run
- .I fmtindex
- with the
- .B -a
- flag to update the index.
- .PP
- The configuration file also holds configuration parameters
- for the venti server itself.
- These are:
- .TF httpaddr netaddr
- .TP
- .BI mem " size
- lump cache size
- .TP
- .BI bcmem " size
- block cache size
- .TP
- .BI icmem " size
- index cache size
- .TP
- .BI addr " netaddr
- network address to announce venti service
- (default
- .BR tcp!*!venti )
- .TP
- .BI httpaddr " netaddr
- network address to announce HTTP service
- (default
- .BR tcp!*!http )
- .TP
- .B queuewrites
- queue writes in memory
- (default is not to queue)
- .TP
- .BI webroot " dir
- directory tree containing files for HTTP server
- to consult for unrecognized URLs
- .PD
- .PP
- The units for the various cache sizes above can be specified by appending a
- .LR k ,
- .LR m ,
- or
- .LR g
- (case-insensitive)
- to indicate kilobytes, megabytes, or gigabytes respectively.
- .PP
- The
- .I file
- name in the configuration lines above can be of the form
- .IB file : lo - hi
- to specify a range of the file.
- .I Lo
- and
- .I hi
- are specified in bytes but can have the usual
- .BI k ,
- .BI m ,
- or
- .B g
- suffixes.
- Either
- .I lo
- or
- .I hi
- may be omitted.
- This notation eliminates the need to
- partition raw disks on non-Plan 9 systems.
- .SS Command Line
- Many of the options to Venti duplicate parameters that
- can be specified in the configuration file.
- The command line options override those found in a
- configuration file.
- Additional options are:
- .TP
- .BI -c " config
- The server configuration file
- (default
- .BR venti.conf )
- .TP
- .B -d
- Produce various debugging information on standard error.
- Implies
- .BR -s .
- .TP
- .B -L
- Enable logging. By default all logging is disabled.
- Logging slows server operation considerably.
- .TP
- .B -r
- Allow only read access to the venti data.
- .TP
- .B -s
- Do not run in the background.
- Normally,
- the foreground process will exit once the Venti server
- is initialized and ready for connections.
- .PD
- .SH EXAMPLE
- A simple configuration:
- .IP
- .EX
- % cat venti.conf
- index main
- isect /tmp/disks/isect0
- isect /tmp/disks/isect1
- arenas /tmp/disks/arenas
- bloom /tmp/disks/bloom
- mem 10M
- bcmem 20M
- icmem 30M
- %
- .EE
- .PP
- Format the index sections, the arena partition, and
- finally the main index:
- .IP
- .EX
- % venti/fmtisect isect0. /tmp/disks/isect0 &
- % venti/fmtisect isect1. /tmp/disks/isect1 &
- % venti/fmtarenas arenas0. /tmp/disks/arenas &
- % venti/fmtbloom /tmp/disks/bloom &
- % wait
- % venti/fmtindex venti.conf
- %
- .EE
- .PP
- Start the server and check the storage statistics:
- .IP
- .EX
- % venti/venti
- % hget http://$sysname/storage
- .EE
- .SH SOURCE
- .B /sys/src/cmd/venti/srv
- .SH "SEE ALSO"
- .IR venti (1),
- .IR venti (2),
- .IR venti (6),
- .IR venti-backup (8)
- .IR venti-fmt (8)
- .br
- Sean Quinlan and Sean Dorward,
- ``Venti: a new approach to archival storage'',
- .I "Usenix Conference on File and Storage Technologies" ,
- 2002.
- .SH BUGS
- Setting up a venti server is too complicated.
- .PP
- Venti should not require the user to decide how to
- partition its memory usage.
|