scanmail 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447
  1. .TH SCANMAIL 8
  2. .SH NAME
  3. scanmail, testscan \- spam filters
  4. .SH SYNOPSIS
  5. .B upas/scanmail
  6. [
  7. .I options
  8. ]
  9. [
  10. .I qer-args
  11. ]
  12. .I root
  13. .B mail
  14. .I sender system rcpt-list
  15. .PP
  16. .B upas/testscan
  17. [
  18. .B -avd
  19. ]
  20. [
  21. .B -p
  22. .I patfile
  23. ]
  24. [
  25. .I filename
  26. ]
  27. .SH DESCRIPTION
  28. .B Scanmail
  29. accepts a mail message supplied on standard input,
  30. applies a file of patterns to a portion of it,
  31. and dispatches
  32. the message based
  33. on the results.
  34. It exactly replaces the
  35. generic queuing command
  36. .IR qer (8)
  37. that is executed from the
  38. .IR rc (1)
  39. script
  40. .B /mail/lib/qmail
  41. in the mail processing pipeline.
  42. Associated with each pattern is an
  43. .I action
  44. in order of decreasing priority:
  45. .in +5
  46. .TP 10
  47. .B dump
  48. the message is deleted and a log entry is written to
  49. .B /sys/log/smtpd
  50. .TP 10
  51. .B hold
  52. the message is placed in a queue for human inspection
  53. .TP
  54. .B log
  55. a line containing the matching portion of the message is written to a log
  56. .in -5
  57. .PP
  58. If no pattern matches or only patterns with an action of
  59. .B log
  60. match, the message is accepted and
  61. .I scanmail
  62. queues the message for delivery.
  63. .I Scanmail
  64. meshes with the blocking facilities
  65. of
  66. .IR smtpd (6)
  67. to provide several layers of
  68. filtering on gateway systems. In all cases the sender
  69. is notified that the message has been successfully
  70. delivered,
  71. leaving the sender unaware that the message has been potentially delayed or deleted.
  72. .PP
  73. .I Scanmail
  74. accepts the arguments of
  75. .IR qer (8)
  76. as well as the following:
  77. .TF filename
  78. .TP
  79. .B -c
  80. Save a copy of each message in a
  81. randomly-named file in
  82. directory
  83. .BR /mail/copy .
  84. .TP
  85. .B -d
  86. Write debugging information to standard error.
  87. .TP
  88. .B -h
  89. Queue
  90. .I held
  91. messages by sending domain name.
  92. The
  93. .B -q
  94. option must specify a root directory; messages
  95. are queued in subdirectories of this directory.
  96. If the
  97. .B -h
  98. option is not specified,
  99. messages are accumulated in a subdirectory of
  100. .B /mail/queue.hold
  101. named for the contents of
  102. .BR /dev/user ,
  103. usually
  104. .BR none .
  105. .TF filename
  106. .TP
  107. .B -n
  108. Messages are never held for inspection, but are delivered. Also known as
  109. .IR "vacation mode" .
  110. .TP
  111. .BI -p " filename"
  112. Read the patterns from
  113. .I filename
  114. rather than
  115. .BR /mail/lib/patterns .
  116. .TP
  117. .BI -q " holdroot"
  118. Queue deliverable messages in subdirectories of
  119. .IR holdroot .
  120. This option is the same as the
  121. .B -q
  122. option of
  123. .IR qer (8)
  124. and must be present if the
  125. .B -h
  126. option is given.
  127. .TP
  128. .B -s
  129. Save deleted
  130. messages. Messages are stored, one per randomly-named file,
  131. in subdirectories of
  132. .B /mail/queue.dump
  133. named with the date.
  134. .TP
  135. .B -t
  136. Test mode. The pattern matcher is applied but the message is
  137. discarded and the result is not logged.
  138. .TP
  139. .B -v
  140. Print the highest priority match.
  141. This is useful
  142. with the
  143. .B -t
  144. option for testing the pattern matcher without actually
  145. sending a message.
  146. .PD
  147. .PP
  148. .I Testscan
  149. is the command line version of
  150. .IR scanmail .
  151. If
  152. .I filename
  153. is missing, it applies the pattern set to
  154. the message on standard input. Unlike
  155. .IR scanmail ,
  156. which finds the highest priority match,
  157. .I testscan
  158. prints all matches in the portion of the message under test.
  159. It is useful for testing a pattern set or
  160. implementing a personal filter
  161. using the
  162. .B pipeto
  163. file in a user's mail directory.
  164. .I Testscan
  165. accepts the following options:
  166. .TP
  167. .B -a
  168. Print matches in the complete input message
  169. .TP
  170. .B -d
  171. Enable debug mode
  172. .TP
  173. .B -v
  174. Print the message after conversion to canonical form
  175. .RI ( q.v. ).
  176. .TP
  177. .BI -p " filename"
  178. Read the patterns from
  179. .I filename
  180. rather than
  181. .BR /mail/lib/patterns .
  182. .SS Canonicalization
  183. Before pattern matching, both programs convert a portion of
  184. the message header and the beginning of the
  185. message to a canonical form. The amount of the header
  186. and message body processed are set by
  187. compile-time parameters in the source files.
  188. The canonicalization process converts letters to lower-case and
  189. replaces consecutive spaces, tabs and newline characters
  190. with a single space. HTML commands are
  191. deleted except for the parameters following
  192. .B A
  193. .BR HREF ,
  194. .B IMG
  195. .BR SRC ,
  196. and
  197. .B IMG
  198. .B BORDER
  199. directives. Additionally, the following MIME escape sequences
  200. are replaced by their ASCII
  201. equivalents:
  202. .PP
  203. .EX
  204. Escape Seq ASCII
  205. ---------- -----
  206. =2e .
  207. =2f /
  208. =20 <space>
  209. =3d =
  210. .EE
  211. and the sequence
  212. .I =<newline>
  213. is elided.
  214. .I Scanmail
  215. assembles the sender, destination domain and recipient fields of
  216. the command line into a string that is
  217. subjected to the same canonical processing.
  218. Following canonicalization, the command line and
  219. the two long strings containing
  220. the header and the message body are passed to the
  221. matching engine for analysis.
  222. .SS Pattern Syntax
  223. The matching engine compiles the pattern set
  224. and matches it to each canonicalized input string.
  225. Patterns are specified one per line
  226. as follows:
  227. .PP
  228. .EX
  229. {*}\fIaction\fP: \fIpattern-spec\fP {~~\fIoverride\fP...~~\fIoverride\fP}
  230. .EE
  231. .PP
  232. On all lines, a
  233. .B #
  234. introduces a comment; there is no way to escape this character.
  235. .PP
  236. Lines beginning with
  237. .B *
  238. contain a
  239. .I pattern-spec
  240. that is a string; otherwise, the the
  241. .I pattern-spec
  242. is a regular expression in the style of
  243. .IR regexp (6).
  244. Regular expression matching is many
  245. times less efficient than string matching, so it is
  246. wiser to enumerate several similar strings
  247. than to combine them into a regular expression.
  248. The
  249. .I action
  250. is a keyword terminated by a
  251. .B :
  252. and separated from the pattern by optional white-space.
  253. It must be one of the following:
  254. .TP 10
  255. .B dump
  256. if the pattern matches, the message is deleted. If the
  257. .B -s
  258. command line option is set, the message is saved.
  259. .TP 10
  260. .B hold
  261. if the pattern matches, the message is queued in a subdirectory
  262. of
  263. .B /mail/queue.hold
  264. for manual inspection. After inspection, the queue can be swept
  265. manually using
  266. .B runq
  267. (see
  268. .IR qer (8))
  269. to deliver messages that were inadvertently matched.
  270. .TP 10
  271. .B header
  272. this is the same as the
  273. .B hold
  274. action, except the pattern is only applied to the message header.
  275. This optimization is useful for patterns that match header fields
  276. that are unlikely to be present in the body of the message.
  277. .TP 10
  278. .B line
  279. the sender and a section of the message around the match are written to
  280. the file
  281. .BR /sys/log/lines .
  282. The message is always delivered.
  283. .TP 10
  284. .B loff
  285. patterns of this type are applied only to the canonicalized command line.
  286. When a match occurs, all patterns with
  287. .B line
  288. actions are disabled. This is useful for limiting
  289. the size of the log file by excluding repetitive messages, such
  290. as those from mailing lists.
  291. .PP
  292. Patterns are accumulated into pattern sets sharing the same action.
  293. The matching engine applies the
  294. .B dump
  295. pattern set first, then the
  296. .B header
  297. and
  298. .B hold
  299. pattern sets, and finally the
  300. .B line
  301. pattern set. Each pattern set is applied three times:
  302. to the canonicalized command line, to the message header, and
  303. finally to the message body. The ordering of patterns
  304. in the pattern file is insignificant.
  305. .PP
  306. The
  307. .I pattern-spec
  308. is a string of characters terminated by a
  309. .BR newline ,
  310. .B #
  311. or override indicator,
  312. .BR ~~ .
  313. Trailing white-space is deleted but
  314. patterns containing leading or trailing white-space can
  315. be enclosed in double-quote
  316. characters. A pattern containing a double-quote
  317. must be enclosed in double-quote
  318. characters and preceded by a backslash.
  319. For example, the pattern
  320. .PP
  321. .EX
  322. "this is not \\"spam\\""
  323. .EE
  324. .PP
  325. matches the string \fLthis is not "spam"\fP.
  326. The
  327. .I pattern-spec
  328. is followed by zero or more
  329. .I override
  330. strings. When the specific pattern matches,
  331. each override is applied and
  332. if one matches, it cancels the effect of the pattern.
  333. Overrides must be strings; regular expressions are not supported.
  334. Each override is introduced by the string
  335. .BR ~~
  336. and continues until a subsequent
  337. .BR ~~ ,
  338. .B #
  339. or
  340. .BR newline ,
  341. white-space included.
  342. A
  343. .B ~~
  344. immediately followed by a
  345. .B newline
  346. indicates a line continuation and further overrides continue
  347. on the following line.
  348. Leading white-space
  349. on the continuation line is ignored. For example,
  350. .PP
  351. .EX
  352. *hold: sex.com~~essex.com~~sussex.com~~sysex.com~~
  353. lasex.com~~cse.psu.edu!owner-9fans
  354. .EE
  355. .PP
  356. matches all input containing the string
  357. .B sex.com
  358. except for messages that also contain the
  359. strings in the override list. Often it
  360. is desirable to override a pattern based on
  361. the name of the sender or
  362. recipient. For this reason, each override
  363. pattern is applied to the header and the command line as well
  364. as the section of the
  365. canonicalized input containing the matching data.
  366. Thus a pattern matching the command line or the header
  367. searches both the command line and the header
  368. for overrides while a match in the body searches
  369. the body, header and command line for overrides.
  370. .PP
  371. The structure of the pattern file and the matching
  372. algorithm define the strategy for detecting
  373. and filtering unwanted messages. Ideally, a
  374. .B hold
  375. pattern selects a message for inspection and if it
  376. is determined to be undesirable, a specific
  377. .B dump
  378. pattern is added to delete further instances
  379. of the message. Additionally, it is often
  380. useful to block the sender by updating the
  381. .B smtpd
  382. control file.
  383. .PP
  384. In this regime, patterns with a
  385. .I dump
  386. action, generally match phrases
  387. that are likely to be unique. Patterns that
  388. hold a message for inspection
  389. match phrases commonly found in undesirable material and
  390. occasionally in legitimate messages. Patterns
  391. that log matches are less specific yet. In all
  392. cases the ability to override a pattern by
  393. matching another string, allows repetitive messages
  394. that trigger the pattern, such as mailing lists,
  395. to pass the filter after the first one is processed
  396. manually. The
  397. .B -s
  398. option allows deleted messages to be salvaged
  399. by either manual or semi-automatic review, supporting
  400. the specification of more aggressive patterns.
  401. Finally, the utility of the pattern matcher is not
  402. confined to filtering spam; it is a generally useful
  403. administrative tool for deleting inadvertently harmful
  404. messages, for example, mail loops, stuck senders or viruses.
  405. It is also useful for collecting or counting messages
  406. matching certain criteria.
  407. .SH FILES
  408. .TF /mail/queue.dump/*
  409. .TP
  410. .B /mail/lib/patterns
  411. default pattern file
  412. .TP
  413. .B /sys/log/smtpd
  414. log of deleted messages
  415. .TP
  416. .B /mail/log/lines
  417. file where
  418. .I log
  419. matches are logged
  420. .TP
  421. .B /mail/queue/*
  422. directories where legitimate messages are queued for delivery
  423. .TP
  424. .B /mail/queue.hold
  425. directory where held messages are queued for inspection
  426. .TP
  427. .B /mail/queue.dump/*
  428. directory where
  429. .I dumped
  430. messages are stored when the
  431. .B -s
  432. command line option is specified.
  433. .TP
  434. .B /mail/copy/*
  435. directory where copies of all incoming messages
  436. are stored.
  437. .SH SOURCE
  438. .TP
  439. .B /sys/src/cmd/upas/scanmail
  440. .SH "SEE ALSO"
  441. .IR mail (1),
  442. .IR qer (8),
  443. .IR smtpd (6)
  444. .SH BUGS
  445. .I Testscan
  446. does not report a match when the body of a message
  447. contains exactly one line.