manual page for spamprobe

NAME
     spamprobe - a Bayesian probability spam analysis engine

SYNOPSIS
     spamprobe [options]  [filename...]

DESCRIPTION
     Welcome to SpamProbe!  Are you tired of  the  constant  bom-
     bardment  of your inbox by unwanted email pushing everything
     from porn to get rich quick schemes?  Have you  tried  other
     spam  filters  but  become  disenchanted  with them when you
     realized that their manually  generated  rule  sets  weren't
     updated  fast  enough  to  keep  up  with  spammers  wording
     changes?  Or that they  generated  unwanted  false  positive
     scores?

     SpamProbe operates on a different basis  entirely.   Instead
     of using pattern matching and a set of human generated rules
     SpamProbe relies on a Bayesian analysis of the frequency  of
     words  used in spam and non-spam emails received by an indi-
     vidual person.  The  process  is  completely  automatic  and
     tailors  itself  to  the  kinds  of  emails that each person
     receives.

     My work on SpamProbe was inspired by an excellent article by
     Paul  Graham.   He describes the basic idea and his results.
     You can read his article here:

          http://www.paulgraham.com/spam.html

     I highly recommend reading the article and  the  other  spam
     related  links  on  his site for excellent insights into why
     spam is a problem and how you can defeat it.  Of course run-
     ning SpamProbe is an excellent step! :-)

  FEATURES
     *    Spam detection using Bayesian analysis  of  terms  con-
          tained  in  each  email.  Words used often in spams but
          not in good email tend to indicate that  a  message  is
          spam.

     *    Written in C++ for good performance.   Database  access
          using  BerkeleyDB for quick startup and fast term count
          retrieval.

     *    Recognition  and  decoding  of  MIME   attachments   in
          quoted-printable  and  base64  encoding.  Automatically
          skips  non-text  attachments.   MIME  decoding  enables
          SpamProbe  to  make  decisions  based  on  words in the
          emails rather than base64 gobbledigook.

     *    Counts two word phrases as well  as  single  words  for
          higher precision.

     *    Ignores HTML tags in emails for scoring purposes unless
          the  -h  command  line  option is used.  Many spams use
          HTML and few humans do so HTML tends to become a power-
          ful recognizer of spams.  However in the author's opin-
          ion this also substantially increases the likelihood of
          false  positives  if someone does send a non-spam email
          containing HTML tags.  SpamProbe does  pull  urls  from
          inside  of  html  tags  however  since those tend to be
          spammer specific.

     *    Locks mboxes and databases using fcntl file locking  to
          avoid  problems  when multiple emails arrive simultane-
          ously.

     *    Scores only the Received, Subject,  To,  From,  and  Cc
          headers.  All other headers are ignored to make it hard
          for spammers to hide non-spammy words in X- headers  to
          fool  the  filter.   The  -H command line option can be
          used to override this.

     *    Supports Content-Length: field in mbox  headers.   This
          can  be  disabled  using -Y option to use only From_ to
          recognize new messages.

     *    Uses MD5 hash of emails to  recognize  reclassification
          of  an  already  classified spam to avoid distortion of
          the word counts if emails are reclassified.   This  way
          emails  can  be  kept  in  an  mbox  that is repeatedly
          scanned by spamprobe without counting  them  more  than
          once.

     *    Provides a time stamp based database cleanup command to
          remove  terms  from  the database if their counts never
          rise above a  certain  threshold  value  (normally  2).
          Tends  to  limit  the otherwise unbounded growth of the
          database as each message adds some unique terms.   Also
          provides  a  purge  command  to  remove  all terms with
          counts below a specified minimum no matter  what  their
          age.

     *    The -X option uses a more aggressive scoring  algorithm
          which  uses all significantly good or spammy terms when
          scoring a message and also gives more weight to  within
          message  frequency  of  terms.   This  method  improves
          recall and works best when the database contains a  few
          thousand messages.

     *    edit-term command allows users to directly  modify  the
          counts  of  individual  terms.   For example to force a
          particular term to be considered spammy or good.

OPTIONS
     -a char
          By  default  SpamProbe  converts  non-ascii  characters
          (characters  with  the  most  significant bit set to 1)
          into the letter 'z'.  This is useful  for  lumping  all
          Asian  characters  into a single word for easy recogni-
          tion.  The -a option allows you to change the character
          to  something else if you don't like the letter 'z' for
          some reason.

     -c   Tells spamprobe to create the database directory if  it
          does  not already exist.  Normally spamprobe exits with
          a usage  error  if  the  database  directory  does  not
          already exist.

     -d directory
          By default SpamProbe stores its database in a directory
          named

          specify a different directory to use.  This  is  neces-
          sary if your home directory is NFS mounted for example.

     -D directory
          Tells SpamProbe to use the database  in  the  specified
          directory  (must  be  different  than the one specified
          with the -d option) as a shared database from which  to
          draw terms that are not defined in the user's own data-
          base.  This can be used to provide a baseline  database
          shared  by  all users on a system (in the -D directory)
          and a private database unique to each user of the  sys-
          tem ($HOME/.spamprobe or -d directory).

     -g field_name
          Tells SpamProbe what header to look for previous  score
          and  message digest in.  Default is X-SpamProbe.  Field
          name is not  case  sensitive.   Used  by  all  commands
          except receive.

     -h   By default SpamProbe removes HTML markup from the  text
          in emails to help avoid false positives.  The -h option
          allows you to override this behavior  and  force  Spam-
          Probe  to  include  words  from within HTML tags in its
          word counts.  Note that  SpamProbe  always  counts  any
          URLs  in  hrefs  within tags whether -h is used or not.
          Use of this option is discouraged.  It can increase the
          rate  of  spam  detection  slightly but unless the user
          receives a significant amount of HTML  emails  it  also
          tends to increase the number of false positives.

     -H option
          By default SpamProbe only scans a meaningful subset  of
          headers from the email message when searching for words
          to score.  The -H option allows  the  user  to  specify
          additional  headers  to  scan.  Legal values are "all",
          "nox", "none", or "normal".  "all" scans  all  headers,
          "nox"  scans all headers except those starting with X-,
          "none" does not scan headers, and  "normal"  scans  the
          normal set of headers.

     -m   Forces SpamProbe to use mbox format for reading  emails
          in  receive  mode.  Normally SpamProbe assumes that the
          input to receive mode contains a single message  so  it
          doesn't look for message breaks.

     -M   Forces SpamProbe to treat the entire input as a  single
          message.   This  ignores  From lines and Content-Length
          headers in the input.  Convenient  when  using  maildir
          format.

     -r number
          Changes the number of times that a  single  word/phrase
          can  occur in the top words array used to calculate the
          score for each message.  Allowing repeats  reduces  the
          number  of  words overall (since a single word occupies
          more than one slot) but allows words which  occur  fre-
          quently  in  the message to have a higher weight.  Gen-
          erally this is changed only for optimization purposes.

     -s number
          SpamProbe maintains an in memory cache of the words  it
          has  seen  in  previous messages to reduce disk i/o and
          improve performance.  By default the cache  is  flushed
          and  cleared  every  250  messages.  This number can be
          changed using the -s option.  A value  of  zero  causes
          SpamProbe to use 100,000 as the limit which effectively
          means that the cache will only be  flushed  at  program
          exit  (unless  you have really enormous mailbox files).
          The cache doesn't affect receive, dump, or  export  but
          has a significant impact on the others.

     -T   Causes SpamProbe to write out the top terms  associated
          with  each  message  in  addition to its normal output.
          Works with find-good, find-spam, and score.

     -v   Tells  SpamProbe  to  write  debugging  information  to
          stderr.  This can be useful for debugging or for seeing
          which terms SpamProbe used to score each email.

     -V   Prints  version  and  copyright  information  and  then
          exits.

     -w number
          Changes the number of  most  significant  words/phrases
          used  by  SpamProbe  to  calculate  the  score for each
          message.  Generally this is changed only for  optimiza-
          tion purposes.

     -x   Normally SpamProbe uses only  a  fixed  number  of  top
          terms (as set by the -w command line option) when scor-
          ing emails.  The -x option can be  used  to  allow  the
          array  to  be  extended past the max size if more terms
          are available with probabilities <= 0.1 or >= 0.9.

     -X   An  interesting  variation  on  the  scoring  settings.
          Equivalent to using "-w5 -r5 -x" so that generally only
          words with probabilites <= 0.1 or >= 0.9 are  used  and
          word frequencies in the email count heavily towards the
          score.  Tests have shown that this setting tends to  be
          safer  (fewer  false  positives) and have higher recall
          (proper classification of spams  previously  scored  as
          spam) although its predictive power isn't quite as good
          as the default settings.  WARNING: This  setting  might
          work  best  with a fairly large corpus, it has not been
          tested with a small corpus so it might be very  inaccu-
          rate with fewer than 1000 total messages.

     -Y   Assume traditional Berkeley  mailbox  format,  ignoring
          any Content-Length: fields.

     -7   Tells SpamProbe to ignore any characters with the  most
          significant bit set to 1 instead of mapping them to the
          letter 'z'.

     -8   Tells SpamProbe to store all characters even  if  their
          most significant bit is set to 1.

     SpamProbe recognizes the following commands:

     receive [filename...]
          Tells SpamProbe to read its standard input (or  a  file
          specified after the receive command) and score it using
          the current  databases.   Once  the  message  has  been
          scored  the  message  is  classified  as either spam or
          non-spam  and  its  word  counts  are  written  to  the
          appropriate  database.   The message's score is written
          to stdout along with a single word.  For example:

               SPAM 0.99 595f0150587edd7b395691964069d7af
          or
               GOOD 0.02 595f0150587edd7b395691964069d7af

          The string of numbers and letters after  the  score  is
          the  message's  "digest",  a  32 character number which
          uniquely identifies the message.  The digest is used by
          SpamProbe  to  recognize messages that it has processed
          previously  so  that  it  can  keep  its  word   counts
          consistent if the message is reclassified.

          Using the -T option additionally lists the  terms  used
          to produce the score along with their counts (number of
          times they were found in the message).

     score [filename...]
          Similar to receive except  that  the  database  is  not
          modified in any way.

     summarize [filename...]
          Similar to score except that it prints a short  summary
          and  score  for  each message.  This can be useful when
          testing.  Using the -T option  additionally  lists  the
          terms used to produce the score along with their counts
          (number of times they were found in the message).

     find-spam [filename...]
          Similar to score except that it prints a short  summary
          and  score  for  each  message that is determined to be
          spam.  This can be useful when testing.  Using  the  -T
          option additionally lists the terms used to produce the
          score along with their counts  (number  of  times  they
          were found in the message).

     find-good [filename...]
          Similar to score except that it prints a short  summary
          and  score  for  each  message that is determined to be
          good.  This can be useful when testing.  Using  the  -T
          option additionally lists the terms used to produce the
          score along with their counts  (number  of  times  they
          were found in the message).

     good [filename...]
          Scans each file (or stdin if no file is specified)  and
          reclassifies  every email in the file as non-spam.  The
          databases are updated appropriately.   Messages  previ-
          ously  classified  as  good (recognized using their MD5
          digest or message ids) are  ignored.   Messages  previ-
          ously classified as spam are reclassified as good.

     spam [filename...]
          Scans each file (or stdin if no file is specified)  and
          reclassifies  every  email  in  the  file as spam.  The
          databases are updated appropriately.   Messages  previ-
          ously  classified  as  spam (recognized using their MD5
          digest of message ids) are  ignored.   Messages  previ-
          ously classified as good are reclassified as spam.

     remove [filename...]
          Scans each file (or stdin if no file is specified)  and
          removes  its  term  counts from the database.  Messages
          which are not in the database (recognized  using  their
          MD5 digest of message ids) are ignored.

     cleanup [junk_count[max_age]]
          Scans  the  database  and  removes   all   terms   with
          junk_count or less (default 2) which have not had their
          counts modified in at least max_age days  (default  7).
          This  should  be  run periodically to keep the database
          from growing endlessly.

          For my own email I use cron to run the cleanup  command
          every  day and delete all terms with count of 4 or less
          that have not been modified in two weeks. Here  is  the
          excerpt from my crontab:


               3 0 * * * /home/brian/bin/spamprobe cleanup 4 14

          Because of the way that BerkeleyDB works  the  database
          file  will  not  actually shrink, but newly added terms
          will be able to use the space  previously  occupied  by
          any  removed  terms so that the file's growth should be
          significantly slower if this collection  is  used.   To
          actually  shrink  the  database you can build a new one
          using  the  BerkeleyDB  utility  programs  db_dump  and
          db_load.  For example:


               cd ~/.spamprobe
               db_dump sp_words | db_load sp_words.new
               mv sp_words sp_words.old
               mv sp_words.new sp_words

          This command does nothing for GDBM databases.

     purge [junk_count]
          Similar to cleanup but forces the immediate deletion of
          all   terms  with  total  count  less  than  junk_count
          (default is 2) no matter how long  it  has  been  since
          they  were  modified (i.e. even if they were just added
          today). This could be handy immediately after classify-
          ing a large mailbox of historical spam or good email to
          make room for the next batch.  This command does  noth-
          ing for GDBM databases.

     edit-term term good_count spam_count
          Can be used to  specifically  set  the  good  and  spam
          counts  of  a  term.   Whether  this is truly useful is
          doubtful but it is provided for completeness sake.  For
          example  it could be used to force a particular word to
          be very spammy or very good:

               spamprobe edit-term nigeria 0 1000000
               spamprobe edit-term burton  10000000 0

     dump Prints the contents of the  word  counts  database  one
          word per line in human readable format with good count,
          spam count, and  word  in  columns  separated  by  whi-
          tespace.   Note  that  when using GDBM for the database
          the words are printed in the order they are  hashed  so
          the  results  will need to be sorted to be most useful.
          DB sorts terms alphabetically.  The standard unix  sort
          command  can be used to sort the terms as desired.  For
          example to list all words from "most  good"  to  "least
          good" use this command:

               spamprobe dump | sort -k 1 -n -r

          To list all words from "most spammy" to "least  spammy"
          use this command:

               spamprobe dump | sort -k 2 -n -r

     export
          Similar to the dump command but prints the  counts  and
          words  in  a comma separated format with the words sur-
          rounded by double quotes.  This can be more useful  for
          importing into some databases.

     import filename
          Reads the specified files  which  must  contain  export
          data  written  by  the  export  command.  The terms and
          counts from this file are added to the database.   This
          can be used to convert a database from a prior version.

FILES
     ~/.spamprobe

SEE ALSO
     procmail, formail

BUGS
GETTING STARTED
     SpamProbe is not a stand alone mail filter.  It doesn't sort
     your  mail or split it into different mailboxes.  Instead it
     relies on some other program such as  procmail  to  actually
     file your mail for you.  What SpamProbe does do is track the
     word counts in good and spam emails and generate a score for
     each  email that indicates whether or not it is likely to be
     spam.  Scores range from 0 to 1 with any  score  of  0.9  or
     higher indicating a probable spam.

     Personally I use SpamProbe with procmail to filter my incom-
     ing  email  into  mail  boxes.   I  have procmail score each
     inbound email using SpamProbe and insert  a  special  header
     into  each email containing its score.  Then I have procmail
     move spams into a special mailbox.

     No spam filter is perfect and SpamProbe sometimes makes mis-
     takes.   To  correct those mistakes I have a special mailbox
     that I put undetected spams into.  I run SpamProbe  periodi-
     cally  and  have it reclassify any emails in that mailbox as
     spam so that it will make  a  better  guess  the  next  time
     around.

     This is not a procmail primer.  You will need to ensure that
     you  have  procmail and formail installed before you can use
     this technique.  Also I recommend that you read the procmail
     documentation  so that you can fully understand this example
     and adapt it to your own needs.  That having been  said,  my
     .procmailrc file looks like this:

          MAILDIR=$HOME/IMAP

          :0 c saved

          :0 SCORE=| /home/brian/bin/spamprobe receive
          :0 wf
          | formail -I "X-SpamProbe: $SCORE"
          :0 a:
          *^X-SpamProbe: SPAM
          spamprobe

     I use IMAP to fetch my email so my mailboxes all live  in  a
     directory  named  IMAP  on my mail server.  The first stanza
     copies all incoming emails into a special mbox called saved.
     SpamProbe  IS  BETA SOFTWARE and though it works well for me
     it is possible that it could somehow lose  emails.   Caution
     is always a good idea...

     The second stanza runs spamprobe in "receive" mode.  In that
     mode  SpamProbe  scores  the email and then classifies it as
     either spam or good based on the  score.   It  automatically
     adds  the word counts for the email to the appropriate data-
     base.  This is essentially like running in score  mode  fol-
     lowed immediately by either spam or good mode.

     The next stanza runs formail to add a custom header  to  the
     email containing the SpamProbe score.  The final stanza uses
     the contents of the custom header  to  file  detected  spams
     into a special mbox named spamprobe.

MAKING CORRECTIONS
     SpamProbe is not perfect.  It is able to detect over 90%  of
     the  spams  that  I receive but some still slip through.  To
     correct these missed emails I run SpamProbe periodically and
     have  it  scan a special mbox.  Since I use IMAP to retrieve
     my emails I can simply drop undetected spams into this  mbox
     from  my  mail  client.  If you use POP or some other system
     then you will need to find a way get  the  undetected  spams
     into a mbox that spamprobe can see.

     Periodically I run a script that scans three special  mboxes
     to correct errors in judgment:

          #!/bin/bash

          IMAPDIR=$HOME/IMAP

          spamprobe remove $IMAPDIR/remove
          spamprobe spam $IMAPDIR/spam
          spamprobe good $IMAPDIR/nonspam

     From this example you can  see  that  I  use  three  special
     mboxes.   I copy emails that I don't want spamprobe to store
     into the remove mbox.  This is useful if you  receive  email
     from  a  friend  or  colleague  that looks like spam and you
     don't want it to dilute the effectiveness of  the  terms  it
     contains.

     Undetected spams go into  the  spam  mbox.   SpamProbe  will
     reclassify  those  emails  as  spam and correct its database
     accordingly.  Note that doing this does not  guarantee  that
     the  spam will always be scored as spam in the future.  Some
     spams are too bland to detect perfectly.  Fortunately  those
     are very rare.

     The nonspam mbox is for  any  false  positives.   These  are
     always possible and it is important to have a way to reclas-
     sify them when they do occur.

     Finally you'll need to build  a  starting  database.   Since
     SpamProbe relies on word counts from past emails it requires
     a decent sized database to be accurate.  To build the  data-
     base  select  some  of  your  mboxes containing past emails.
     Ideally you should have one mbox of spams and one or more of
     non-spams.   If  you  don't  have any spams handy then don't
     worry, SpamProbe will gradually become more accurate as  you
     receive  more  spams.   Expect  a fairly high false negative
     (i.e. missed spams) rate as you first start using SpamProbe.

     To import your starting messages use commands such as these.
     The example assumes that you have non-spams stored in a file
     named mbox in your home directory and some spams stored in a
     file named nasty-spams.  Replace these names with real ones.

          spamprobe good ~/mbox
          spamprobe spam ~/nasty-spams

WARRANTY
     SpamProbe works well for me.  However please  keep  in  mind
     that  there  is NO WARRANTY at all with this software.  Read
     the QPL (LICENSE.txt) for details.  YOU ASSUME ALL RISK when
     using this software.

     Be sure to visit the project page on sourceforge.  There you
     can  submit  bug  reports or feature requests, read and post
     messages on the forums, and download the latest version.

          http://sourceforge.net/projects/spamprobe/

     You can also join the  spamprobe  mailing  list  to  discuss
     issues with other SpamProbe users.

      http://lists.sourceforge.net/lists/listinfo/spamprobe-users

     Also     feel      free      to      contact      me      at
     bburton@users.sourceforge.net   with   any  suggestions  for
     improvements that you don't want to post to the forums.

LEGALESE
     Burton Computer Corporation
     http://www.burton-computer.com
     http://www.cooldevtools.com

     Copyright (C) 2002 Burton Computer Corporation
     ALL RIGHTS RESERVED

     This program is open source software; you  can  redistribute
     it  and/or modify it under the terms of the Q Public License
     (QPL) version 1.0. Use of this software in whole or in part,
     including  linking  it  (modified  or unmodified) into other
     programs is subject to the terms of the QPL.

     This program is distributed in the hope that it will be use-
     ful, but WITHOUT ANY WARRANTY; without even the implied war-
     ranty of MERCHANTABILITY or FITNESS FOR  A  PARTICULAR  PUR-
     POSE.  See the Q Public License for more details.

     You should have received a copy  of  the  Q  Public  License
     along  with this program; see the file LICENSE.txt.  If not,
     visit the Burton Computer Corporation  or  CoolDevTools  web
     site QPL pages at:

          http://www.burton-computer.com/qpl.html
          http://www.cooldevtools.com/qpl.html