[Mimedefang] Considering an additional spam filter

Jim McCullars jim at info.uah.edu
Fri May 23 17:46:02 EDT 2003


Hi Folks, I am considering something and I'd like your opinions.

We are getting a lot of spam that SpamAssassin is not catching because it
has a high degree of "obfuscated html" and SA catches this only when the
spammer uses actual HTML comment tags.  In other words, if I have SA look
for the phrase, "penis enlargement", and the spammer does this:

 P<!-- hello -->en<!-- hello again -->is enl<!-- hi -->argement

SpamAssassin will catch it.  But if they do this instead:

 P<jkfdhjk>en<hjkdfhkj>is enl<asdfhjkldfas>argement

It will not, because SpamAssassin's body check does not recognize the
bogus HTML tags (even though most mail clients will render the message as
the spammer intends).

So I am considering doing something that I have resisted for some time -
putting in my filter a routing that actually opens the input message and
does my own analysis of it.  It shouldn't be too slow, since I use a RAM
file system for the spool directory.

Here is a test program that I have been playing with, that I am thinking
of adapting into my mimedefang-filter file.  What is does is this:  Read
each line and grab anything that looks like an HTML tag.  Store it in an
array.  After the file is read, go through the array and compare the tags
against a list of valid HTML tags.  See what the results are.  The valid
count should heavily outweigh the invalid count.  When I ran this program
against the splash page on our main web server, I got 550 valid and one
invalid (and that was apparently a typo).  When I ran it against one of
those "body fat pill" spams (as seen on all the networks, and Oprah), I
got something like 34 valid and 60 invalid.  I'm thinking I can use this
to reject possible spam, possibly with a message like "Invalid HTML
content".  What do you guys think?

Here is the program I was playing with:

#!/usr/local/bin/perl
@tags = qw(a abbr acronym address applet area b base basefont bdo big
           blockquote body br button caption center cite code col colgroup
           dd del dfn dir div dl dt em fieldset font form frame frameset
           h1 h2 h3 h4 h5 h6 head hr html i iframe img input ins isindex kbd
           label legend li link map menu meta noframes noscript object ol
           optgroup option p param pre q s samp script select small span
           strike strong style sub sup table tbody td textarea tfoot th
           thead title tr tt u ul var);
# Contruct a reg exp for later...
foreach $tag (@tags) {
  if($tag eq "a") { $regexp = $tag } else
                  { $regexp = $regexp . "|" . $tag }
}
# print"the regexp is $regexp\n";
open (IN, "somefile.html");
$body = 0;
while (<IN>) {
  chop();
  if (length($_) == 0) { $body = 1; next }  # skip headers
  next unless $body;
  @list = m/<([^>]*)>/g;   # get all tags from this line
  push(@taglist, @list);   # add to tag list
}
close(IN);
# Now compute the ratio of valid tags to invalid tags
#
foreach $taglistitem (@taglist) {
  $taglistitem =~ s/^\///;     # remove beginning / (if any)
  next if ($taglistitem =~ /^!--/);   # skip valid comment
  if($taglistitem =~ /^\b($regexp)\b/i) {
    ++$validtag
  } else {
    ++$invalidtag;
#    print"Invalid tag $taglistitem\n";
  }
}
print"The number of valid tags encountered was $validtag\n";
print"The number of invliad tags encountered was $invalidtag\n";

   Comments?

Jim





More information about the MIMEDefang mailing list