[Mimedefang] Considering an additional spam filter

Michael Sims michaels at crye-leike.com
Sat May 24 10:04:01 EDT 2003


Quoting Jim McCullars <jim at info.uah.edu>:

> We are getting a lot of spam that SpamAssassin is not catching because it
> has a high degree of "obfuscated html" and SA catches this only when the
> spammer uses actual HTML comment tags.

Yeah, I'm seeing this too.  I first noticed it when I saw the word "Oprah" in 
some HTML spam.  I knew that the mere presence of the word "Oprah" should have 
greatly increased the spam score, but this message had scored very low.  When I 
viewed the source I saw what you describe with a letter or two interspersed 
with invalid HTML tags.

I think it's a good idea to try and strip these tags out, but this is one of 
those things that's extremely complicated.  Tags can be split across multiple 
lines and can be nested inside HTML comments, and you can also have tags like 
this:

<IMG SRC = "foo.gif" ALT = "A > B">

Technically speaking the greater than in the ALT attribute should be ">", 
but browsers will still render the above without complaint.

The Perl Cookbook (O'Reilly) recommends using the CPAN module HTML::Parse and 
HTML::FormatText, like so:

use HTML::Parse;
use HTML::FormatText;
$plain_text = HTML::FormatText->new->format(parse_html($html_text));

I haven't used this myself but it's something you might want to look into.  
Personally this is a bit too involved for me...I'm hoping that the next version 
of SpamAssassin will be better at catching this sort of thing...

___________________________________________
Michael Sims
Project Analyst - Information Technology
Crye-Leike Realtors
Office: (901)758-5648  Pager: (901)769-3722
___________________________________________



More information about the MIMEDefang mailing list