[Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly?

Thu Feb 21 05:36:41 EST 2013

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 20 Feb 2013, Philip Prindeville wrote:

> Awesome, that worked!
>
> I'm wondering if in MIME::Body we should take:
>
> sub as_string {
> my $self = shift;
> my $str = '';
> my $fh = IO::File->new(\$str, '>:') or croak("Cannot open in-memory file: $!");
> $self->print($fh);
> close($fh);
> return $str;
> }
>
> and have:
>
> return Encode::decode($charset, $str);

I suppose that violates the internals of the MIME:: and Mail:: namespace 
functions. They are tied together very closly.

Actually, I looked into a UTF8 MIMEtools a few years back to overcome 
character set problems when storing header data into a postgres database. 
I thought that everything the MIME:: functions should return would be in 
Perl utf8, any character set information already decoded. Anything the 
functions get passed into is Perl internal utf-8 as well. I think one 
would need to rewrite the whole framework anew.

> instead, but I'm not sure how we'd retrieve $charset…  It would need to be stored into MIME::Body which isn't currently the case.

Encode is a tricky module by its own, perldoc Encode:

"Handling Malformed Data
        The optional CHECK argument tells Encode what to do when it 
encounters malformed data.  Without CHECK, Encode::FB_DEFAULT ( == 0 ) is 
assumed.

        As of version 2.12 Encode supports coderef values for CHECK.  See 
below.

        NOTE: Not all encoding support this feature
          Some encodings ignore CHECK argument.  For example, 
Encode::Unicode ignores CHECK and it always croaks on error.
"

Some encodings modify the $str argument to return the characters NOT 
decoded. So you'd call Encode::decode($charset, "".$str) to enforce a copy 
- - but have the performance penalty.

I also got weired results with decode('latin1', $str). I guess because of 
"CAVEAT: When you run "$string = decode("utf8", $octets)", then $string 
may not be equal to $octets.  Though they both contain the same data, the 
UTF8 flag for $string is on unless $octets entirely consists of ASCII data 
(or EBCDIC on EBCDIC machines)."
When I pass results of decode('latin1', $str) to LDAP or Postgres, I 
sometimes get errors.

I pass all strings through a function now, that looks terrible, but since 
then Web, Postgres, LDAP and text files play together.

> On Feb 20, 2013, at 6:21 PM, David F. Skoll <dfs at roaringpenguin.com> wrote:
>> Try putting "use Encode;" near the top of your test file and replacing
>>
>> utf8::upgrade($string);
>>
>> with:
>>
>> $string = Encode::decode('utf-8', $string);

In fact, I found that utf8::upgrade() works for me in order to replace 
decode('latin1'), which seems to "do nothing", causing other modules, like 
Net::LDAP or DBD::Pg, to pass invalid UTF8 to the services.

- -- 
Steffen Kaiser
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iQEVAwUBUSX4uZ8mjdm1m0FfAQJLPAf9EPC0E+gm5cJ4PvwxQHT2MzGoTmfLz1/C
nd7kihJnCqmWHQeYLhRlETqX4D1vG/ZGS6WbaP8Fybn400Tfb4JZBs9kZafS7dri
z3r6wk70Vd0By7GM5zIPlTbovU7HqiIFBBoHrdLkaSvzGq95ZfyH5u8aZjj39D85
2nDracTpxp9VF1rsgDi9I3z2lJpRjtJsufVUTvIhynOghQoAhw0S8FEAp7CrLnOX
UHsTTW1+CPhJA3zxY7jgGKV65smNYjtB4MZ1D0cxq2Y6Op7R2NmbRZrlXfFsfMBs
ah7y6nOmlOOpJ1oG760qZY31GjAcvuHgzcliV6rBXueMb1qSM3yHyw==
=A/mV
-----END PGP SIGNATURE-----