brennan at columbia.edu
Sat Mar 18 09:58:47 EST 2006
The recent talk about anchors with "https" text and "http" links
has nudged me to post the code I've been working on that uses
HTML::TokeParser to look for bad things in HTML. See the wiki
So far I have not seen HTML::TokeParser confused by all the
obfuscation tricks used by spammers. It takes care of identifying
all the tags and their attributes, and categorizes them for us as
start, end, text, etc. All I had to code was what to look for and
what to do with it.
HTMLCheck actually changes messages, by commenting out some of
the bad things with <!-- tags -->. For other bad things it asks
for the message to be rejected.
Among the bad things are those mismatched anchors. We compare
the domains of the visible and real urls. If they do not
match, we comment out the anchor tag leaving the visible url
to be copy-pasted if wanted. If the visible is https and the
real is http and the domains do not match, reject. Thus...
<a href="http://foo.com">http://bar.com</a> becomes
<!-- <a href="http://foo.com"> -->http://bar.com<!-- </a> -->
<a href="https://foo.com">http://foo.com</a> becomes
<!-- <a href="https://foo.com"> -->http://foo.com<!-- </a> -->
<a href="https://foo.com">http://bar.com</a> is rejected.
Columbia University Information Technology
More information about the MIMEDefang