[Mimedefang] scanning message body
Joseph Brennan
brennan at columbia.edu
Fri Mar 19 12:27:15 EST 2004
--On Friday, March 19, 2004 10:24 AM -0600 Matthew Simpson
<matthew at symatec-computer.com> wrote:
> I need some quick help scanning the message body for URLs and certain HTML
> tags.
We do this, in filter(), to catch tags. It changes <object to
<no-object, etc, disabling it.
# Check for bad code in HTML parts
if ($type eq "text/html") {
my($bla,$badtag);
if ($io = $entity->open("r")) {
while (defined($_ = $io->getline)) {
# note iframe, script, object
if (/<(iframe|script|object) /i) {
$badtag = $1;
$_ =~ s/<(iframe|script|object)\b/<no-$1 /ig;
}
$bla .= $_;
}
$io->close;
}
if ($badtag) {
if ($io = $entity->open("w")) {
$io->print($bla);
$io->close;
}
md_graphdefang_log('modify',"$badtag tag deactivated");
action_change_header("X-Warning",
"$badtag tag modified by Columbia filter");
action_rebuild();
}
}
Bugged IMG tags are probably next thing to go into this section.
Personally I use a MUA that does not show images.
Scanning for URLs is much harder. The above does not catch things
broken over more than one line. You can set $/="\n\n" to work by
paragraphs but I think some of the more obfuscated garbage even
spans paragraphs. I just started looking at this. Basically you
have to catch <a.href and then buffer all till the next </a>, with
some kind of stream input. I didn't peak at Anomy HTML Cleaner yet
to see how they do it :-) And if you really want to do a lot of
HTML cleaning, well, they do it all-- more than we want to do.
Joseph Brennan
Academic Technologies Group, Academic Information Systems (AcIS)
Columbia University in the City of New York
More information about the MIMEDefang
mailing list