[Mimedefang] Mailing lists, ham, and broken MUA's

Tue May 4 13:56:31 EDT 2010

On 05/04/2010 07:51 AM, Steffen Kaiser wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Mon, 3 May 2010, Philip Prindeville wrote:
>
>> The problem is this: the message will be intelligible to English
>> language readers, but it will generate a lot of false positives for
>> mailing list recipients who aren't expecting to get non-English
>> messages (or English messages encoded in anything other than USASCII,
>> ISO-8895-1, or UTF-8).
>
> :-)
>
>> If the message body is Content-Type: text/plain; charset=xxxx should
>> it be squashed down in the case of mailing list traffic for English
>> language mailing lists?
>
> Nice idea. To make it really work, you should exempt the signature.
> Meaning, there are people using their native spelling as
> name.

Found the reference, if anyone cares... RFC 2046, last paragraph of
section 4.1.2:

   In general, composition software should always use the "lowest common
   denominator" character set possible.  For example, if a body contains
   only US-ASCII characters, it SHOULD be marked as being in the US-
   ASCII character set, not ISO-8859-1, which, like all the ISO-8859
   family of character sets, is a superset of US-ASCII.  More generally,
   if a widely-used character set is a subset of another character set,
   and a body contains only characters in the widely-used subset, it
   should be labelled as being in that subset.  This will increase the
   chances that the recipient will be able to view the resulting entity
   correctly.

>
>> use Encode::First qw(encode_first);
>>
>> my $encodings = join('ascii', 'latin1', 'utf-8', $oldcharset);
> my $encodings = join(',', 'ascii', 'latin1', 'utf-8');
>
> "utf8" matches always, IMHO, but first you have to decode() the content,
> which BTW I found problematic in its own, that's why I'm using
> a "decode_first"-like function:
>
> try decode with supplied charset, then check if it is good utf8,
> then decode as latin1, which matches always.
>
> That's the same with your sequence: $oldcharset will never reached
> because you can always encode to 'utf-8'.

Ok, right.  So try the first two...  And if they don't work, then
transcode to utf-8.

>
>> my ($newcharset, $newlen) = encode_first($encodings, $string);
>>
>> if ($newlen<= length($string)) {
>>    # use $newstr instead
>> }
>
> This check does not fit, IMHO: If you have a real, 7bit clean ASCII
> message, it should be the same in any other multi-byte or 8bit
> encodings, because they use ASCII as bases, don't they?

Ok, so:

use Encode::First qw(encode_first);

# also need to handle aliases...
if ($oldcharset eq 'ascii' || $oldcharset eq 'latin1' || $oldcharset eq 'utf-8') {
    ;
} else {
    my $encodings = join(',', 'ascii', 'latin1', $oldcharset);

    my ($newcharset, $newlen) = encode_first($encodings, $string);

    if ($newcharset eq $oldcharset) {
        $newcharset = 'utf-8';
    }

    # transcode as $newcharset
}

>
> Your goal is to hide the Asian charset for English messages,
> therefore I would use:
>
> my %goodCharset = ( qw/ascii latin1 iso-8856-1/ );
> if(!$goodCharset{lc $oldcharset} && $goodCharset{lc $newcharset}) {
>     # replace body
> }
> UTF-8 does not do any good, but hides the Asian font :-)
>
> Regards,
>
> - -- Steffen Kaiser

Well, not just Asian... Koi-8, Cyrillic, etc. as well.  And all of those
windows-xxxx abominations.

-Philip