123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447 |
- .TH SCANMAIL 8
- .SH NAME
- scanmail, testscan \- spam filters
- .SH SYNOPSIS
- .B upas/scanmail
- [
- .I options
- ]
- [
- .I qer-args
- ]
- .I root
- .B mail
- .I sender system rcpt-list
- .PP
- .B upas/testscan
- [
- .B -avd
- ]
- [
- .B -p
- .I patfile
- ]
- [
- .I filename
- ]
- .SH DESCRIPTION
- .B Scanmail
- accepts a mail message supplied on standard input,
- applies a file of patterns to a portion of it,
- and dispatches
- the message based
- on the results.
- It exactly replaces the
- generic queuing command
- .IR qer (8)
- that is executed from the
- .IR rc (1)
- script
- .B /mail/lib/qmail
- in the mail processing pipeline.
- Associated with each pattern is an
- .I action
- in order of decreasing priority:
- .in +5
- .TP 10
- .B dump
- the message is deleted and a log entry is written to
- .B /sys/log/smtpd
- .TP 10
- .B hold
- the message is placed in a queue for human inspection
- .TP
- .B log
- a line containing the matching portion of the message is written to a log
- .in -5
- .PP
- If no pattern matches or only patterns with an action of
- .B log
- match, the message is accepted and
- .I scanmail
- queues the message for delivery.
- .I Scanmail
- meshes with the blocking facilities
- of
- .IR smtpd (6)
- to provide several layers of
- filtering on gateway systems. In all cases the sender
- is notified that the message has been successfully
- delivered,
- leaving the sender unaware that the message has been potentially delayed or deleted.
- .PP
- .I Scanmail
- accepts the arguments of
- .IR qer (8)
- as well as the following:
- .TF filename
- .TP
- .B -c
- Save a copy of each message in a
- randomly-named file in
- directory
- .BR /mail/copy .
- .TP
- .B -d
- Write debugging information to standard error.
- .TP
- .B -h
- Queue
- .I held
- messages by sending domain name.
- The
- .B -q
- option must specify a root directory; messages
- are queued in subdirectories of this directory.
- If the
- .B -h
- option is not specified,
- messages are accumulated in a subdirectory of
- .B /mail/queue.hold
- named for the contents of
- .BR /dev/user ,
- usually
- .BR none .
- .TF filename
- .TP
- .B -n
- Messages are never held for inspection, but are delivered. Also known as
- .IR "vacation mode" .
- .TP
- .BI -p " filename"
- Read the patterns from
- .I filename
- rather than
- .BR /mail/lib/patterns .
- .TP
- .BI -q " holdroot"
- Queue deliverable messages in subdirectories of
- .IR holdroot .
- This option is the same as the
- .B -q
- option of
- .IR qer (8)
- and must be present if the
- .B -h
- option is given.
- .TP
- .B -s
- Save deleted
- messages. Messages are stored, one per randomly-named file,
- in subdirectories of
- .B /mail/queue.dump
- named with the date.
- .TP
- .B -t
- Test mode. The pattern matcher is applied but the message is
- discarded and the result is not logged.
- .TP
- .B -v
- Print the highest priority match.
- This is useful
- with the
- .B -t
- option for testing the pattern matcher without actually
- sending a message.
- .PD
- .PP
- .I Testscan
- is the command line version of
- .IR scanmail .
- If
- .I filename
- is missing, it applies the pattern set to
- the message on standard input. Unlike
- .IR scanmail ,
- which finds the highest priority match,
- .I testscan
- prints all matches in the portion of the message under test.
- It is useful for testing a pattern set or
- implementing a personal filter
- using the
- .B pipeto
- file in a user's mail directory.
- .I Testscan
- accepts the following options:
- .TP
- .B -a
- Print matches in the complete input message
- .TP
- .B -d
- Enable debug mode
- .TP
- .B -v
- Print the message after conversion to canonical form
- .RI ( q.v. ).
- .TP
- .BI -p " filename"
- Read the patterns from
- .I filename
- rather than
- .BR /mail/lib/patterns .
- .SS Canonicalization
- Before pattern matching, both programs convert a portion of
- the message header and the beginning of the
- message to a canonical form. The amount of the header
- and message body processed are set by
- compile-time parameters in the source files.
- The canonicalization process converts letters to lower-case and
- replaces consecutive spaces, tabs and newline characters
- with a single space. HTML commands are
- deleted except for the parameters following
- .B A
- .BR HREF ,
- .B IMG
- .BR SRC ,
- and
- .B IMG
- .B BORDER
- directives. Additionally, the following MIME escape sequences
- are replaced by their ASCII
- equivalents:
- .PP
- .EX
- Escape Seq ASCII
- ---------- -----
- =2e .
- =2f /
- =20 <space>
- =3d =
- .EE
- and the sequence
- .I =<newline>
- is elided.
- .I Scanmail
- assembles the sender, destination domain and recipient fields of
- the command line into a string that is
- subjected to the same canonical processing.
- Following canonicalization, the command line and
- the two long strings containing
- the header and the message body are passed to the
- matching engine for analysis.
- .SS Pattern Syntax
- The matching engine compiles the pattern set
- and matches it to each canonicalized input string.
- Patterns are specified one per line
- as follows:
- .PP
- .EX
- {*}\fIaction\fP: \fIpattern-spec\fP {~~\fIoverride\fP...~~\fIoverride\fP}
- .EE
- .PP
- On all lines, a
- .B #
- introduces a comment; there is no way to escape this character.
- .PP
- Lines beginning with
- .B *
- contain a
- .I pattern-spec
- that is a string; otherwise, the the
- .I pattern-spec
- is a regular expression in the style of
- .IR regexp (6).
- Regular expression matching is many
- times less efficient than string matching, so it is
- wiser to enumerate several similar strings
- than to combine them into a regular expression.
- The
- .I action
- is a keyword terminated by a
- .B :
- and separated from the pattern by optional white-space.
- It must be one of the following:
- .TP 10
- .B dump
- if the pattern matches, the message is deleted. If the
- .B -s
- command line option is set, the message is saved.
- .TP 10
- .B hold
- if the pattern matches, the message is queued in a subdirectory
- of
- .B /mail/queue.hold
- for manual inspection. After inspection, the queue can be swept
- manually using
- .B runq
- (see
- .IR qer (8))
- to deliver messages that were inadvertently matched.
- .TP 10
- .B header
- this is the same as the
- .B hold
- action, except the pattern is only applied to the message header.
- This optimization is useful for patterns that match header fields
- that are unlikely to be present in the body of the message.
- .TP 10
- .B line
- the sender and a section of the message around the match are written to
- the file
- .BR /sys/log/lines .
- The message is always delivered.
- .TP 10
- .B loff
- patterns of this type are applied only to the canonicalized command line.
- When a match occurs, all patterns with
- .B line
- actions are disabled. This is useful for limiting
- the size of the log file by excluding repetitive messages, such
- as those from mailing lists.
- .PP
- Patterns are accumulated into pattern sets sharing the same action.
- The matching engine applies the
- .B dump
- pattern set first, then the
- .B header
- and
- .B hold
- pattern sets, and finally the
- .B line
- pattern set. Each pattern set is applied three times:
- to the canonicalized command line, to the message header, and
- finally to the message body. The ordering of patterns
- in the pattern file is insignificant.
- .PP
- The
- .I pattern-spec
- is a string of characters terminated by a
- .BR newline ,
- .B #
- or override indicator,
- .BR ~~ .
- Trailing white-space is deleted but
- patterns containing leading or trailing white-space can
- be enclosed in double-quote
- characters. A pattern containing a double-quote
- must be enclosed in double-quote
- characters and preceded by a backslash.
- For example, the pattern
- .PP
- .EX
- "this is not \\"spam\\""
- .EE
- .PP
- matches the string \fLthis is not "spam"\fP.
- The
- .I pattern-spec
- is followed by zero or more
- .I override
- strings. When the specific pattern matches,
- each override is applied and
- if one matches, it cancels the effect of the pattern.
- Overrides must be strings; regular expressions are not supported.
- Each override is introduced by the string
- .BR ~~
- and continues until a subsequent
- .BR ~~ ,
- .B #
- or
- .BR newline ,
- white-space included.
- A
- .B ~~
- immediately followed by a
- .B newline
- indicates a line continuation and further overrides continue
- on the following line.
- Leading white-space
- on the continuation line is ignored. For example,
- .PP
- .EX
- *hold: sex.com~~essex.com~~sussex.com~~sysex.com~~
- lasex.com~~cse.psu.edu!owner-9fans
- .EE
- .PP
- matches all input containing the string
- .B sex.com
- except for messages that also contain the
- strings in the override list. Often it
- is desirable to override a pattern based on
- the name of the sender or
- recipient. For this reason, each override
- pattern is applied to the header and the command line as well
- as the section of the
- canonicalized input containing the matching data.
- Thus a pattern matching the command line or the header
- searches both the command line and the header
- for overrides while a match in the body searches
- the body, header and command line for overrides.
- .PP
- The structure of the pattern file and the matching
- algorithm define the strategy for detecting
- and filtering unwanted messages. Ideally, a
- .B hold
- pattern selects a message for inspection and if it
- is determined to be undesirable, a specific
- .B dump
- pattern is added to delete further instances
- of the message. Additionally, it is often
- useful to block the sender by updating the
- .B smtpd
- control file.
- .PP
- In this regime, patterns with a
- .I dump
- action, generally match phrases
- that are likely to be unique. Patterns that
- hold a message for inspection
- match phrases commonly found in undesirable material and
- occasionally in legitimate messages. Patterns
- that log matches are less specific yet. In all
- cases the ability to override a pattern by
- matching another string, allows repetitive messages
- that trigger the pattern, such as mailing lists,
- to pass the filter after the first one is processed
- manually. The
- .B -s
- option allows deleted messages to be salvaged
- by either manual or semi-automatic review, supporting
- the specification of more aggressive patterns.
- Finally, the utility of the pattern matcher is not
- confined to filtering spam; it is a generally useful
- administrative tool for deleting inadvertently harmful
- messages, for example, mail loops, stuck senders or viruses.
- It is also useful for collecting or counting messages
- matching certain criteria.
- .SH FILES
- .TF /mail/queue.dump/*
- .TP
- .B /mail/lib/patterns
- default pattern file
- .TP
- .B /sys/log/smtpd
- log of deleted messages
- .TP
- .B /mail/log/lines
- file where
- .I log
- matches are logged
- .TP
- .B /mail/queue/*
- directories where legitimate messages are queued for delivery
- .TP
- .B /mail/queue.hold
- directory where held messages are queued for inspection
- .TP
- .B /mail/queue.dump/*
- directory where
- .I dumped
- messages are stored when the
- .B -s
- command line option is specified.
- .TP
- .B /mail/copy/*
- directory where copies of all incoming messages
- are stored.
- .SH SOURCE
- .TP
- .B /sys/src/cmd/upas/scanmail
- .SH "SEE ALSO"
- .IR mail (1),
- .IR qer (8),
- .IR smtpd (6)
- .SH BUGS
- .I Testscan
- does not report a match when the body of a message
- contains exactly one line.
|