[Mimedefang] scanning message body

Fri Mar 19 12:27:15 EST 2004

--On Friday, March 19, 2004 10:24 AM -0600 Matthew Simpson 
<matthew at symatec-computer.com> wrote:

> I need some quick help scanning the message body for URLs and certain HTML
> tags.

We do this, in filter(), to catch tags.  It changes <object to
<no-object, etc, disabling it.

    # Check for bad code in HTML parts
    if ($type eq "text/html") {
        my($bla,$badtag);
        if ($io = $entity->open("r")) {
            while (defined($_ = $io->getline)) {
                # note iframe, script, object
                if (/<(iframe|script|object) /i) {
                    $badtag = $1;
                    $_ =~ s/<(iframe|script|object)\b/<no-$1 /ig;
                }
                $bla .= $_;
            }
            $io->close;
        }
        if ($badtag) {
            if ($io = $entity->open("w")) {
                $io->print($bla);
                $io->close;
            }
            md_graphdefang_log('modify',"$badtag tag deactivated");
            action_change_header("X-Warning",
                                 "$badtag tag modified by Columbia filter");
            action_rebuild();
        }
    }

Bugged IMG tags are probably next thing to go into this section.
Personally I use a MUA that does not show images.

Scanning for URLs is much harder.  The above does not catch things
broken over more than one line.  You can set $/="\n\n" to work by
paragraphs but I think some of the more obfuscated garbage even
spans paragraphs.  I just started looking at this.  Basically you
have to catch <a.href and then buffer all till the next </a>, with
some kind of stream input.  I didn't peak at Anomy HTML Cleaner yet
to see how they do it :-)   And if you really want to do a lot of
HTML cleaning, well, they do it all-- more than we want to do.

Joseph Brennan
Academic Technologies Group, Academic Information Systems (AcIS)
Columbia University in the City of New York