[Mimedefang] Considering an additional spam filter
Michael Sims
michaels at crye-leike.com
Sat May 24 10:04:01 EDT 2003
Quoting Jim McCullars <jim at info.uah.edu>:
> We are getting a lot of spam that SpamAssassin is not catching because it
> has a high degree of "obfuscated html" and SA catches this only when the
> spammer uses actual HTML comment tags.
Yeah, I'm seeing this too. I first noticed it when I saw the word "Oprah" in
some HTML spam. I knew that the mere presence of the word "Oprah" should have
greatly increased the spam score, but this message had scored very low. When I
viewed the source I saw what you describe with a letter or two interspersed
with invalid HTML tags.
I think it's a good idea to try and strip these tags out, but this is one of
those things that's extremely complicated. Tags can be split across multiple
lines and can be nested inside HTML comments, and you can also have tags like
this:
<IMG SRC = "foo.gif" ALT = "A > B">
Technically speaking the greater than in the ALT attribute should be ">",
but browsers will still render the above without complaint.
The Perl Cookbook (O'Reilly) recommends using the CPAN module HTML::Parse and
HTML::FormatText, like so:
use HTML::Parse;
use HTML::FormatText;
$plain_text = HTML::FormatText->new->format(parse_html($html_text));
I haven't used this myself but it's something you might want to look into.
Personally this is a bit too involved for me...I'm hoping that the next version
of SpamAssassin will be better at catching this sort of thing...
___________________________________________
Michael Sims
Project Analyst - Information Technology
Crye-Leike Realtors
Office: (901)758-5648 Pager: (901)769-3722
___________________________________________
More information about the MIMEDefang
mailing list