manual page for spamprobe
NAME
spamprobe - a Bayesian probability spam analysis engine
SYNOPSIS
spamprobe [options] [filename...]
DESCRIPTION
Welcome to SpamProbe! Are you tired of the constant bom-
bardment of your inbox by unwanted email pushing everything
from porn to get rich quick schemes? Have you tried other
spam filters but become disenchanted with them when you
realized that their manually generated rule sets weren't
updated fast enough to keep up with spammers wording
changes? Or that they generated unwanted false positive
scores?
SpamProbe operates on a different basis entirely. Instead
of using pattern matching and a set of human generated rules
SpamProbe relies on a Bayesian analysis of the frequency of
words used in spam and non-spam emails received by an indi-
vidual person. The process is completely automatic and
tailors itself to the kinds of emails that each person
receives.
My work on SpamProbe was inspired by an excellent article by
Paul Graham. He describes the basic idea and his results.
You can read his article here:
http://www.paulgraham.com/spam.html
I highly recommend reading the article and the other spam
related links on his site for excellent insights into why
spam is a problem and how you can defeat it. Of course run-
ning SpamProbe is an excellent step! :-)
FEATURES
* Spam detection using Bayesian analysis of terms con-
tained in each email. Words used often in spams but
not in good email tend to indicate that a message is
spam.
* Written in C++ for good performance. Database access
using BerkeleyDB for quick startup and fast term count
retrieval.
* Recognition and decoding of MIME attachments in
quoted-printable and base64 encoding. Automatically
skips non-text attachments. MIME decoding enables
SpamProbe to make decisions based on words in the
emails rather than base64 gobbledigook.
* Counts two word phrases as well as single words for
higher precision.
* Ignores HTML tags in emails for scoring purposes unless
the -h command line option is used. Many spams use
HTML and few humans do so HTML tends to become a power-
ful recognizer of spams. However in the author's opin-
ion this also substantially increases the likelihood of
false positives if someone does send a non-spam email
containing HTML tags. SpamProbe does pull urls from
inside of html tags however since those tend to be
spammer specific.
* Locks mboxes and databases using fcntl file locking to
avoid problems when multiple emails arrive simultane-
ously.
* Scores only the Received, Subject, To, From, and Cc
headers. All other headers are ignored to make it hard
for spammers to hide non-spammy words in X- headers to
fool the filter. The -H command line option can be
used to override this.
* Supports Content-Length: field in mbox headers. This
can be disabled using -Y option to use only From_ to
recognize new messages.
* Uses MD5 hash of emails to recognize reclassification
of an already classified spam to avoid distortion of
the word counts if emails are reclassified. This way
emails can be kept in an mbox that is repeatedly
scanned by spamprobe without counting them more than
once.
* Provides a time stamp based database cleanup command to
remove terms from the database if their counts never
rise above a certain threshold value (normally 2).
Tends to limit the otherwise unbounded growth of the
database as each message adds some unique terms. Also
provides a purge command to remove all terms with
counts below a specified minimum no matter what their
age.
* The -X option uses a more aggressive scoring algorithm
which uses all significantly good or spammy terms when
scoring a message and also gives more weight to within
message frequency of terms. This method improves
recall and works best when the database contains a few
thousand messages.
* edit-term command allows users to directly modify the
counts of individual terms. For example to force a
particular term to be considered spammy or good.
OPTIONS
-a char
By default SpamProbe converts non-ascii characters
(characters with the most significant bit set to 1)
into the letter 'z'. This is useful for lumping all
Asian characters into a single word for easy recogni-
tion. The -a option allows you to change the character
to something else if you don't like the letter 'z' for
some reason.
-c Tells spamprobe to create the database directory if it
does not already exist. Normally spamprobe exits with
a usage error if the database directory does not
already exist.
-d directory
By default SpamProbe stores its database in a directory
named
specify a different directory to use. This is neces-
sary if your home directory is NFS mounted for example.
-D directory
Tells SpamProbe to use the database in the specified
directory (must be different than the one specified
with the -d option) as a shared database from which to
draw terms that are not defined in the user's own data-
base. This can be used to provide a baseline database
shared by all users on a system (in the -D directory)
and a private database unique to each user of the sys-
tem ($HOME/.spamprobe or -d directory).
-g field_name
Tells SpamProbe what header to look for previous score
and message digest in. Default is X-SpamProbe. Field
name is not case sensitive. Used by all commands
except receive.
-h By default SpamProbe removes HTML markup from the text
in emails to help avoid false positives. The -h option
allows you to override this behavior and force Spam-
Probe to include words from within HTML tags in its
word counts. Note that SpamProbe always counts any
URLs in hrefs within tags whether -h is used or not.
Use of this option is discouraged. It can increase the
rate of spam detection slightly but unless the user
receives a significant amount of HTML emails it also
tends to increase the number of false positives.
-H option
By default SpamProbe only scans a meaningful subset of
headers from the email message when searching for words
to score. The -H option allows the user to specify
additional headers to scan. Legal values are "all",
"nox", "none", or "normal". "all" scans all headers,
"nox" scans all headers except those starting with X-,
"none" does not scan headers, and "normal" scans the
normal set of headers.
-m Forces SpamProbe to use mbox format for reading emails
in receive mode. Normally SpamProbe assumes that the
input to receive mode contains a single message so it
doesn't look for message breaks.
-M Forces SpamProbe to treat the entire input as a single
message. This ignores From lines and Content-Length
headers in the input. Convenient when using maildir
format.
-r number
Changes the number of times that a single word/phrase
can occur in the top words array used to calculate the
score for each message. Allowing repeats reduces the
number of words overall (since a single word occupies
more than one slot) but allows words which occur fre-
quently in the message to have a higher weight. Gen-
erally this is changed only for optimization purposes.
-s number
SpamProbe maintains an in memory cache of the words it
has seen in previous messages to reduce disk i/o and
improve performance. By default the cache is flushed
and cleared every 250 messages. This number can be
changed using the -s option. A value of zero causes
SpamProbe to use 100,000 as the limit which effectively
means that the cache will only be flushed at program
exit (unless you have really enormous mailbox files).
The cache doesn't affect receive, dump, or export but
has a significant impact on the others.
-T Causes SpamProbe to write out the top terms associated
with each message in addition to its normal output.
Works with find-good, find-spam, and score.
-v Tells SpamProbe to write debugging information to
stderr. This can be useful for debugging or for seeing
which terms SpamProbe used to score each email.
-V Prints version and copyright information and then
exits.
-w number
Changes the number of most significant words/phrases
used by SpamProbe to calculate the score for each
message. Generally this is changed only for optimiza-
tion purposes.
-x Normally SpamProbe uses only a fixed number of top
terms (as set by the -w command line option) when scor-
ing emails. The -x option can be used to allow the
array to be extended past the max size if more terms
are available with probabilities <= 0.1 or >= 0.9.
-X An interesting variation on the scoring settings.
Equivalent to using "-w5 -r5 -x" so that generally only
words with probabilites <= 0.1 or >= 0.9 are used and
word frequencies in the email count heavily towards the
score. Tests have shown that this setting tends to be
safer (fewer false positives) and have higher recall
(proper classification of spams previously scored as
spam) although its predictive power isn't quite as good
as the default settings. WARNING: This setting might
work best with a fairly large corpus, it has not been
tested with a small corpus so it might be very inaccu-
rate with fewer than 1000 total messages.
-Y Assume traditional Berkeley mailbox format, ignoring
any Content-Length: fields.
-7 Tells SpamProbe to ignore any characters with the most
significant bit set to 1 instead of mapping them to the
letter 'z'.
-8 Tells SpamProbe to store all characters even if their
most significant bit is set to 1.
SpamProbe recognizes the following commands:
receive [filename...]
Tells SpamProbe to read its standard input (or a file
specified after the receive command) and score it using
the current databases. Once the message has been
scored the message is classified as either spam or
non-spam and its word counts are written to the
appropriate database. The message's score is written
to stdout along with a single word. For example:
SPAM 0.99 595f0150587edd7b395691964069d7af
or
GOOD 0.02 595f0150587edd7b395691964069d7af
The string of numbers and letters after the score is
the message's "digest", a 32 character number which
uniquely identifies the message. The digest is used by
SpamProbe to recognize messages that it has processed
previously so that it can keep its word counts
consistent if the message is reclassified.
Using the -T option additionally lists the terms used
to produce the score along with their counts (number of
times they were found in the message).
score [filename...]
Similar to receive except that the database is not
modified in any way.
summarize [filename...]
Similar to score except that it prints a short summary
and score for each message. This can be useful when
testing. Using the -T option additionally lists the
terms used to produce the score along with their counts
(number of times they were found in the message).
find-spam [filename...]
Similar to score except that it prints a short summary
and score for each message that is determined to be
spam. This can be useful when testing. Using the -T
option additionally lists the terms used to produce the
score along with their counts (number of times they
were found in the message).
find-good [filename...]
Similar to score except that it prints a short summary
and score for each message that is determined to be
good. This can be useful when testing. Using the -T
option additionally lists the terms used to produce the
score along with their counts (number of times they
were found in the message).
good [filename...]
Scans each file (or stdin if no file is specified) and
reclassifies every email in the file as non-spam. The
databases are updated appropriately. Messages previ-
ously classified as good (recognized using their MD5
digest or message ids) are ignored. Messages previ-
ously classified as spam are reclassified as good.
spam [filename...]
Scans each file (or stdin if no file is specified) and
reclassifies every email in the file as spam. The
databases are updated appropriately. Messages previ-
ously classified as spam (recognized using their MD5
digest of message ids) are ignored. Messages previ-
ously classified as good are reclassified as spam.
remove [filename...]
Scans each file (or stdin if no file is specified) and
removes its term counts from the database. Messages
which are not in the database (recognized using their
MD5 digest of message ids) are ignored.
cleanup [junk_count[max_age]]
Scans the database and removes all terms with
junk_count or less (default 2) which have not had their
counts modified in at least max_age days (default 7).
This should be run periodically to keep the database
from growing endlessly.
For my own email I use cron to run the cleanup command
every day and delete all terms with count of 4 or less
that have not been modified in two weeks. Here is the
excerpt from my crontab:
3 0 * * * /home/brian/bin/spamprobe cleanup 4 14
Because of the way that BerkeleyDB works the database
file will not actually shrink, but newly added terms
will be able to use the space previously occupied by
any removed terms so that the file's growth should be
significantly slower if this collection is used. To
actually shrink the database you can build a new one
using the BerkeleyDB utility programs db_dump and
db_load. For example:
cd ~/.spamprobe
db_dump sp_words | db_load sp_words.new
mv sp_words sp_words.old
mv sp_words.new sp_words
This command does nothing for GDBM databases.
purge [junk_count]
Similar to cleanup but forces the immediate deletion of
all terms with total count less than junk_count
(default is 2) no matter how long it has been since
they were modified (i.e. even if they were just added
today). This could be handy immediately after classify-
ing a large mailbox of historical spam or good email to
make room for the next batch. This command does noth-
ing for GDBM databases.
edit-term term good_count spam_count
Can be used to specifically set the good and spam
counts of a term. Whether this is truly useful is
doubtful but it is provided for completeness sake. For
example it could be used to force a particular word to
be very spammy or very good:
spamprobe edit-term nigeria 0 1000000
spamprobe edit-term burton 10000000 0
dump Prints the contents of the word counts database one
word per line in human readable format with good count,
spam count, and word in columns separated by whi-
tespace. Note that when using GDBM for the database
the words are printed in the order they are hashed so
the results will need to be sorted to be most useful.
DB sorts terms alphabetically. The standard unix sort
command can be used to sort the terms as desired. For
example to list all words from "most good" to "least
good" use this command:
spamprobe dump | sort -k 1 -n -r
To list all words from "most spammy" to "least spammy"
use this command:
spamprobe dump | sort -k 2 -n -r
export
Similar to the dump command but prints the counts and
words in a comma separated format with the words sur-
rounded by double quotes. This can be more useful for
importing into some databases.
import filename
Reads the specified files which must contain export
data written by the export command. The terms and
counts from this file are added to the database. This
can be used to convert a database from a prior version.
FILES
~/.spamprobe
SEE ALSO
procmail, formail
BUGS
GETTING STARTED
SpamProbe is not a stand alone mail filter. It doesn't sort
your mail or split it into different mailboxes. Instead it
relies on some other program such as procmail to actually
file your mail for you. What SpamProbe does do is track the
word counts in good and spam emails and generate a score for
each email that indicates whether or not it is likely to be
spam. Scores range from 0 to 1 with any score of 0.9 or
higher indicating a probable spam.
Personally I use SpamProbe with procmail to filter my incom-
ing email into mail boxes. I have procmail score each
inbound email using SpamProbe and insert a special header
into each email containing its score. Then I have procmail
move spams into a special mailbox.
No spam filter is perfect and SpamProbe sometimes makes mis-
takes. To correct those mistakes I have a special mailbox
that I put undetected spams into. I run SpamProbe periodi-
cally and have it reclassify any emails in that mailbox as
spam so that it will make a better guess the next time
around.
This is not a procmail primer. You will need to ensure that
you have procmail and formail installed before you can use
this technique. Also I recommend that you read the procmail
documentation so that you can fully understand this example
and adapt it to your own needs. That having been said, my
.procmailrc file looks like this:
MAILDIR=$HOME/IMAP
:0 c saved
:0 SCORE=| /home/brian/bin/spamprobe receive
:0 wf
| formail -I "X-SpamProbe: $SCORE"
:0 a:
*^X-SpamProbe: SPAM
spamprobe
I use IMAP to fetch my email so my mailboxes all live in a
directory named IMAP on my mail server. The first stanza
copies all incoming emails into a special mbox called saved.
SpamProbe IS BETA SOFTWARE and though it works well for me
it is possible that it could somehow lose emails. Caution
is always a good idea...
The second stanza runs spamprobe in "receive" mode. In that
mode SpamProbe scores the email and then classifies it as
either spam or good based on the score. It automatically
adds the word counts for the email to the appropriate data-
base. This is essentially like running in score mode fol-
lowed immediately by either spam or good mode.
The next stanza runs formail to add a custom header to the
email containing the SpamProbe score. The final stanza uses
the contents of the custom header to file detected spams
into a special mbox named spamprobe.
MAKING CORRECTIONS
SpamProbe is not perfect. It is able to detect over 90% of
the spams that I receive but some still slip through. To
correct these missed emails I run SpamProbe periodically and
have it scan a special mbox. Since I use IMAP to retrieve
my emails I can simply drop undetected spams into this mbox
from my mail client. If you use POP or some other system
then you will need to find a way get the undetected spams
into a mbox that spamprobe can see.
Periodically I run a script that scans three special mboxes
to correct errors in judgment:
#!/bin/bash
IMAPDIR=$HOME/IMAP
spamprobe remove $IMAPDIR/remove
spamprobe spam $IMAPDIR/spam
spamprobe good $IMAPDIR/nonspam
From this example you can see that I use three special
mboxes. I copy emails that I don't want spamprobe to store
into the remove mbox. This is useful if you receive email
from a friend or colleague that looks like spam and you
don't want it to dilute the effectiveness of the terms it
contains.
Undetected spams go into the spam mbox. SpamProbe will
reclassify those emails as spam and correct its database
accordingly. Note that doing this does not guarantee that
the spam will always be scored as spam in the future. Some
spams are too bland to detect perfectly. Fortunately those
are very rare.
The nonspam mbox is for any false positives. These are
always possible and it is important to have a way to reclas-
sify them when they do occur.
Finally you'll need to build a starting database. Since
SpamProbe relies on word counts from past emails it requires
a decent sized database to be accurate. To build the data-
base select some of your mboxes containing past emails.
Ideally you should have one mbox of spams and one or more of
non-spams. If you don't have any spams handy then don't
worry, SpamProbe will gradually become more accurate as you
receive more spams. Expect a fairly high false negative
(i.e. missed spams) rate as you first start using SpamProbe.
To import your starting messages use commands such as these.
The example assumes that you have non-spams stored in a file
named mbox in your home directory and some spams stored in a
file named nasty-spams. Replace these names with real ones.
spamprobe good ~/mbox
spamprobe spam ~/nasty-spams
WARRANTY
SpamProbe works well for me. However please keep in mind
that there is NO WARRANTY at all with this software. Read
the QPL (LICENSE.txt) for details. YOU ASSUME ALL RISK when
using this software.
Be sure to visit the project page on sourceforge. There you
can submit bug reports or feature requests, read and post
messages on the forums, and download the latest version.
http://sourceforge.net/projects/spamprobe/
You can also join the spamprobe mailing list to discuss
issues with other SpamProbe users.
http://lists.sourceforge.net/lists/listinfo/spamprobe-users
Also feel free to contact me at
bburton@users.sourceforge.net with any suggestions for
improvements that you don't want to post to the forums.
LEGALESE
Burton Computer Corporation
http://www.burton-computer.com
http://www.cooldevtools.com
Copyright (C) 2002 Burton Computer Corporation
ALL RIGHTS RESERVED
This program is open source software; you can redistribute
it and/or modify it under the terms of the Q Public License
(QPL) version 1.0. Use of this software in whole or in part,
including linking it (modified or unmodified) into other
programs is subject to the terms of the QPL.
This program is distributed in the hope that it will be use-
ful, but WITHOUT ANY WARRANTY; without even the implied war-
ranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PUR-
POSE. See the Q Public License for more details.
You should have received a copy of the Q Public License
along with this program; see the file LICENSE.txt. If not,
visit the Burton Computer Corporation or CoolDevTools web
site QPL pages at:
http://www.burton-computer.com/qpl.html
http://www.cooldevtools.com/qpl.html