From Mark at Misty.com Wed Feb 6 17:00:55 2013 From: Mark at Misty.com (Mark G Thomas) Date: Wed, 6 Feb 2013 17:00:55 -0500 Subject: [Mimedefang] main::rebuild_entity() called too early to check prototype Message-ID: <20130206220055.GA27440@allie.home.misty.com> Hi, I'm running into some issues with perl-5.16.2 and mimedefang. Errors like this: defined(@array) is deprecated at /opt/mimedefang-2.67/bin/mimedefang.pl line 7335. and this: mimedefang-multiplexor[11932]: Slave 5 stderr: main::rebuild_entity() called too early to check prototype at /opt/mimedefang-2.73/bin/mimedefang.pl line 805. The first one seems fixed by just changing to "if (!@arraya)", but I'm not sure how to fix the other error. Mark -- Mark G. Thomas (Mark at Misty.com) From dfs at roaringpenguin.com Wed Feb 6 17:42:58 2013 From: dfs at roaringpenguin.com (David F. Skoll) Date: Wed, 6 Feb 2013 17:42:58 -0500 Subject: [Mimedefang] main::rebuild_entity() called too early to check prototype In-Reply-To: <20130206220055.GA27440@allie.home.misty.com> References: <20130206220055.GA27440@allie.home.misty.com> Message-ID: <20130206174258.1c350575@shishi.roaringpenguin.com> On Wed, 6 Feb 2013 17:00:55 -0500 Mark G Thomas wrote: > mimedefang-multiplexor[11932]: Slave 5 stderr: main::rebuild_entity() > called too early to check prototype > at /opt/mimedefang-2.73/bin/mimedefang.pl line 805. > The first one seems fixed by just changing to "if (!@arraya)", but > I'm not sure how to fix the other error. You can fix it by putting this line: sub rebuild_entity ($$); # Declare for forward declaration right before the existing line: sub rebuild_entity ($$) { (You'll end up with both lines in the file; see collect_parts for an example.) Regards, David. From philipp_subx at redfish-solutions.com Wed Feb 20 19:20:25 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Wed, 20 Feb 2013 17:20:25 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? Message-ID: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> Hi. I'm trying to generate a message as a footer in mimedefang-filter (in filter_end()) when I see certain message contents, but I'm running into what looks like a bug. I've reproduced it here: [philipp]$ cat test.pl #!/usr/bin/perl -w use strict; use warnings; use MIME::Entity; use MIME::QuotedPrint; use HTML::Entities; my $string = decode_qp("Ellipsis=E2=80=A6\n"); utf8::upgrade($string); print "string: ", $string; print "hex: ", unpack('H*', $string), "\n"; my $msg = encode_entities($string, '"<>&'); my @strings = ( "\n", $msg, "\n" ); my $html = MIME::Entity->build( Top => 0, Type => 'text/html', Encoding => 'quoted-printable', Charset => 'utf-8', Data => [ @strings ], ); print $html->as_string(), "\n"; exit 0; [philipp]$ ./test.pl string: Ellipsis? hex: 456c6c6970736973e280a60a Content-Type: text/html; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Ellipsis=C3=A2=C2=80=C2=A6 [philipp]$ from what I can tell, if I do a Data::Dumper() on $html->bodyhandle()->{'MBS_Data'} then it looks like the 3 UTF characters (0xa280a8) have been converted into \x{a2}, \x{80}, \x{a8} instead? Which I don't understand, since I've explicitly called the Charset out as being 'utf-8'. It looks like the string is being interpreted as latin1, not utf8. What am I doing wrong? Thanks, -Philip From philipp_subx at redfish-solutions.com Wed Feb 20 19:55:23 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Wed, 20 Feb 2013 17:55:23 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> References: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> Message-ID: <2BD5D396-4C4B-4E4B-A5A0-89F1496056F9@redfish-solutions.com> Should probably mention I'm running: perl-MIME-Types-1.28-2.el6.noarch perl-MIME-Lite-3.027-2.el6.noarch perl-MIME-tools-5.427-4.el6.noarch perl-MIME-Base32-1.02a-1.el6.noarch mimedefang-2.73-3.el6.x86_64 perl-5.10.1-127.el6.x86_64 on CentOS 6.3. On Feb 20, 2013, at 5:20 PM, Philip Prindeville wrote: > Hi. > > I'm trying to generate a message as a footer in mimedefang-filter (in filter_end()) when I see certain message contents, but I'm running into what looks like a bug. I've reproduced it here: > > [philipp]$ cat test.pl > #!/usr/bin/perl -w > > use strict; > use warnings; > > use MIME::Entity; > use MIME::QuotedPrint; > use HTML::Entities; > > my $string = decode_qp("Ellipsis=E2=80=A6\n"); > > utf8::upgrade($string); > > print "string: ", $string; > > print "hex: ", unpack('H*', $string), "\n"; > > my $msg = encode_entities($string, '"<>&'); > > my @strings = ( > "\n", > $msg, > "\n" > ); > > > my $html = MIME::Entity->build( > Top => 0, > Type => 'text/html', > Encoding => 'quoted-printable', > Charset => 'utf-8', > Data => [ @strings ], > ); > > print $html->as_string(), "\n"; > > exit 0; > [philipp]$ ./test.pl > string: Ellipsis? > hex: 456c6c6970736973e280a60a > Content-Type: text/html; charset="utf-8" > Content-Disposition: inline > Content-Transfer-Encoding: quoted-printable > > > Ellipsis=C3=A2=C2=80=C2=A6 > > > [philipp]$ > > > from what I can tell, if I do a Data::Dumper() on $html->bodyhandle()->{'MBS_Data'} then it looks like the 3 UTF characters (0xa280a8) have been converted into \x{a2}, \x{80}, \x{a8} instead? Which I don't understand, since I've explicitly called the Charset out as being 'utf-8'. It looks like the string is being interpreted as latin1, not utf8. > > What am I doing wrong? > > Thanks, > > -Philip > > _______________________________________________ > NOTE: If there is a disclaimer or other legal boilerplate in the above > message, it is NULL AND VOID. You may ignore it. > > Visit http://www.mimedefang.org and http://www.roaringpenguin.com > MIMEDefang mailing list MIMEDefang at lists.roaringpenguin.com > http://lists.roaringpenguin.com/mailman/listinfo/mimedefang From dfs at roaringpenguin.com Wed Feb 20 20:21:49 2013 From: dfs at roaringpenguin.com (David F. Skoll) Date: Wed, 20 Feb 2013 20:21:49 -0500 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> References: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> Message-ID: <20130220202149.60a4e9c3@shishi.roaringpenguin.com> On Wed, 20 Feb 2013 17:20:25 -0700 Philip Prindeville wrote: > utf8::upgrade($string); UTF-8 handling in Perl is tricky. If you use utf8::upgrade, it is almost certainly a bug. I have never encountered a situation in which utf8::upgrade was appropriate. Try putting "use Encode;" near the top of your test file and replacing utf8::upgrade($string); with: $string = Encode::decode('utf-8', $string); When I did that, the resulting output was: Ellipsis=E2=80=A6 Now study the Encode man page very carefully... Regards, David. From philipp_subx at redfish-solutions.com Wed Feb 20 21:20:56 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Wed, 20 Feb 2013 19:20:56 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? Message-ID: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> Awesome, that worked! I'm wondering if in MIME::Body we should take: sub as_string { my $self = shift; my $str = ''; my $fh = IO::File->new(\$str, '>:') or croak("Cannot open in-memory file: $!"); $self->print($fh); close($fh); return $str; } and have: return Encode::decode($charset, $str); instead, but I'm not sure how we'd retrieve $charset? It would need to be stored into MIME::Body which isn't currently the case. Thanks, -Philip On Feb 20, 2013, at 6:21 PM, David F. Skoll wrote: > > Try putting "use Encode;" near the top of your test file and replacing > > utf8::upgrade($string); > > with: > > $string = Encode::decode('utf-8', $string); From skmimedefang at smail.inf.fh-bonn-rhein-sieg.de Thu Feb 21 05:36:41 2013 From: skmimedefang at smail.inf.fh-bonn-rhein-sieg.de (Steffen Kaiser) Date: Thu, 21 Feb 2013 11:36:41 +0100 (CET) Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> References: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, 20 Feb 2013, Philip Prindeville wrote: > Awesome, that worked! > > I'm wondering if in MIME::Body we should take: > > sub as_string { > my $self = shift; > my $str = ''; > my $fh = IO::File->new(\$str, '>:') or croak("Cannot open in-memory file: $!"); > $self->print($fh); > close($fh); > return $str; > } > > and have: > > return Encode::decode($charset, $str); I suppose that violates the internals of the MIME:: and Mail:: namespace functions. They are tied together very closly. Actually, I looked into a UTF8 MIMEtools a few years back to overcome character set problems when storing header data into a postgres database. I thought that everything the MIME:: functions should return would be in Perl utf8, any character set information already decoded. Anything the functions get passed into is Perl internal utf-8 as well. I think one would need to rewrite the whole framework anew. > instead, but I'm not sure how we'd retrieve $charset? It would need to be stored into MIME::Body which isn't currently the case. Encode is a tricky module by its own, perldoc Encode: "Handling Malformed Data The optional CHECK argument tells Encode what to do when it encounters malformed data. Without CHECK, Encode::FB_DEFAULT ( == 0 ) is assumed. As of version 2.12 Encode supports coderef values for CHECK. See below. NOTE: Not all encoding support this feature Some encodings ignore CHECK argument. For example, Encode::Unicode ignores CHECK and it always croaks on error. " Some encodings modify the $str argument to return the characters NOT decoded. So you'd call Encode::decode($charset, "".$str) to enforce a copy - - but have the performance penalty. I also got weired results with decode('latin1', $str). I guess because of "CAVEAT: When you run "$string = decode("utf8", $octets)", then $string may not be equal to $octets. Though they both contain the same data, the UTF8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines)." When I pass results of decode('latin1', $str) to LDAP or Postgres, I sometimes get errors. I pass all strings through a function now, that looks terrible, but since then Web, Postgres, LDAP and text files play together. > On Feb 20, 2013, at 6:21 PM, David F. Skoll wrote: >> Try putting "use Encode;" near the top of your test file and replacing >> >> utf8::upgrade($string); >> >> with: >> >> $string = Encode::decode('utf-8', $string); In fact, I found that utf8::upgrade() works for me in order to replace decode('latin1'), which seems to "do nothing", causing other modules, like Net::LDAP or DBD::Pg, to pass invalid UTF8 to the services. - -- Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQEVAwUBUSX4uZ8mjdm1m0FfAQJLPAf9EPC0E+gm5cJ4PvwxQHT2MzGoTmfLz1/C nd7kihJnCqmWHQeYLhRlETqX4D1vG/ZGS6WbaP8Fybn400Tfb4JZBs9kZafS7dri z3r6wk70Vd0By7GM5zIPlTbovU7HqiIFBBoHrdLkaSvzGq95ZfyH5u8aZjj39D85 2nDracTpxp9VF1rsgDi9I3z2lJpRjtJsufVUTvIhynOghQoAhw0S8FEAp7CrLnOX UHsTTW1+CPhJA3zxY7jgGKV65smNYjtB4MZ1D0cxq2Y6Op7R2NmbRZrlXfFsfMBs ah7y6nOmlOOpJ1oG760qZY31GjAcvuHgzcliV6rBXueMb1qSM3yHyw== =A/mV -----END PGP SIGNATURE----- From philipp_subx at redfish-solutions.com Thu Feb 21 12:35:51 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Thu, 21 Feb 2013 10:35:51 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: References: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> Message-ID: <7F0FA6BA-D3FD-471D-8313-28BB48DED0D5@redfish-solutions.com> On Feb 21, 2013, at 3:36 AM, Steffen Kaiser wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Wed, 20 Feb 2013, Philip Prindeville wrote: > > I suppose that violates the internals of the MIME:: and Mail:: namespace functions. They are tied together very closely. Not sure I follow. MIME:: can't have external dependencies? > > Actually, I looked into a UTF8 MIMEtools a few years back to overcome character set problems when storing header data into a postgres database. I thought that everything the MIME:: functions should return would be in Perl utf8, any character set information already decoded. Anything the functions get passed into is Perl internal utf-8 as well. I think one would need to rewrite the whole framework anew. It seems a reasonable goal to have all strings be stored in internal format, except for functions that explicitly generate "on-the-wire" formatted strings. Could this also be controlled by either a global gating variable that controls a semantic, or by detecting if the caller has "utf8" loaded into this space? Of course, things get complicated when MIME:: is used by some intermediate module (Foo:: for instance) which uses utf8 but the user's main' doesn't? > > Encode is a tricky module by its own, perldoc Encode: > > [?] > > Some encodings modify the $str argument to return the characters NOT decoded. So you'd call Encode::decode($charset, "".$str) to enforce a copy - - but have the performance penalty. That seems to be an issue with Encode not having consistent semantics. That might be fixed separately. Encode is part of perl core, isn't it? > > [?] > > I pass all strings through a function now, that looks terrible, but since then Web, Postgres, LDAP and text files play together. And would fixing the semantics of Encode to be more uniform fix that, or is this an orthogonal issue? -Philip From Mark at Misty.com Wed Feb 6 17:00:55 2013 From: Mark at Misty.com (Mark G Thomas) Date: Wed, 6 Feb 2013 17:00:55 -0500 Subject: [Mimedefang] main::rebuild_entity() called too early to check prototype Message-ID: <20130206220055.GA27440@allie.home.misty.com> Hi, I'm running into some issues with perl-5.16.2 and mimedefang. Errors like this: defined(@array) is deprecated at /opt/mimedefang-2.67/bin/mimedefang.pl line 7335. and this: mimedefang-multiplexor[11932]: Slave 5 stderr: main::rebuild_entity() called too early to check prototype at /opt/mimedefang-2.73/bin/mimedefang.pl line 805. The first one seems fixed by just changing to "if (!@arraya)", but I'm not sure how to fix the other error. Mark -- Mark G. Thomas (Mark at Misty.com) From dfs at roaringpenguin.com Wed Feb 6 17:42:58 2013 From: dfs at roaringpenguin.com (David F. Skoll) Date: Wed, 6 Feb 2013 17:42:58 -0500 Subject: [Mimedefang] main::rebuild_entity() called too early to check prototype In-Reply-To: <20130206220055.GA27440@allie.home.misty.com> References: <20130206220055.GA27440@allie.home.misty.com> Message-ID: <20130206174258.1c350575@shishi.roaringpenguin.com> On Wed, 6 Feb 2013 17:00:55 -0500 Mark G Thomas wrote: > mimedefang-multiplexor[11932]: Slave 5 stderr: main::rebuild_entity() > called too early to check prototype > at /opt/mimedefang-2.73/bin/mimedefang.pl line 805. > The first one seems fixed by just changing to "if (!@arraya)", but > I'm not sure how to fix the other error. You can fix it by putting this line: sub rebuild_entity ($$); # Declare for forward declaration right before the existing line: sub rebuild_entity ($$) { (You'll end up with both lines in the file; see collect_parts for an example.) Regards, David. From philipp_subx at redfish-solutions.com Wed Feb 20 19:20:25 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Wed, 20 Feb 2013 17:20:25 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? Message-ID: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> Hi. I'm trying to generate a message as a footer in mimedefang-filter (in filter_end()) when I see certain message contents, but I'm running into what looks like a bug. I've reproduced it here: [philipp]$ cat test.pl #!/usr/bin/perl -w use strict; use warnings; use MIME::Entity; use MIME::QuotedPrint; use HTML::Entities; my $string = decode_qp("Ellipsis=E2=80=A6\n"); utf8::upgrade($string); print "string: ", $string; print "hex: ", unpack('H*', $string), "\n"; my $msg = encode_entities($string, '"<>&'); my @strings = ( "\n", $msg, "\n" ); my $html = MIME::Entity->build( Top => 0, Type => 'text/html', Encoding => 'quoted-printable', Charset => 'utf-8', Data => [ @strings ], ); print $html->as_string(), "\n"; exit 0; [philipp]$ ./test.pl string: Ellipsis? hex: 456c6c6970736973e280a60a Content-Type: text/html; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Ellipsis=C3=A2=C2=80=C2=A6 [philipp]$ from what I can tell, if I do a Data::Dumper() on $html->bodyhandle()->{'MBS_Data'} then it looks like the 3 UTF characters (0xa280a8) have been converted into \x{a2}, \x{80}, \x{a8} instead? Which I don't understand, since I've explicitly called the Charset out as being 'utf-8'. It looks like the string is being interpreted as latin1, not utf8. What am I doing wrong? Thanks, -Philip From philipp_subx at redfish-solutions.com Wed Feb 20 19:55:23 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Wed, 20 Feb 2013 17:55:23 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> References: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> Message-ID: <2BD5D396-4C4B-4E4B-A5A0-89F1496056F9@redfish-solutions.com> Should probably mention I'm running: perl-MIME-Types-1.28-2.el6.noarch perl-MIME-Lite-3.027-2.el6.noarch perl-MIME-tools-5.427-4.el6.noarch perl-MIME-Base32-1.02a-1.el6.noarch mimedefang-2.73-3.el6.x86_64 perl-5.10.1-127.el6.x86_64 on CentOS 6.3. On Feb 20, 2013, at 5:20 PM, Philip Prindeville wrote: > Hi. > > I'm trying to generate a message as a footer in mimedefang-filter (in filter_end()) when I see certain message contents, but I'm running into what looks like a bug. I've reproduced it here: > > [philipp]$ cat test.pl > #!/usr/bin/perl -w > > use strict; > use warnings; > > use MIME::Entity; > use MIME::QuotedPrint; > use HTML::Entities; > > my $string = decode_qp("Ellipsis=E2=80=A6\n"); > > utf8::upgrade($string); > > print "string: ", $string; > > print "hex: ", unpack('H*', $string), "\n"; > > my $msg = encode_entities($string, '"<>&'); > > my @strings = ( > "\n", > $msg, > "\n" > ); > > > my $html = MIME::Entity->build( > Top => 0, > Type => 'text/html', > Encoding => 'quoted-printable', > Charset => 'utf-8', > Data => [ @strings ], > ); > > print $html->as_string(), "\n"; > > exit 0; > [philipp]$ ./test.pl > string: Ellipsis? > hex: 456c6c6970736973e280a60a > Content-Type: text/html; charset="utf-8" > Content-Disposition: inline > Content-Transfer-Encoding: quoted-printable > > > Ellipsis=C3=A2=C2=80=C2=A6 > > > [philipp]$ > > > from what I can tell, if I do a Data::Dumper() on $html->bodyhandle()->{'MBS_Data'} then it looks like the 3 UTF characters (0xa280a8) have been converted into \x{a2}, \x{80}, \x{a8} instead? Which I don't understand, since I've explicitly called the Charset out as being 'utf-8'. It looks like the string is being interpreted as latin1, not utf8. > > What am I doing wrong? > > Thanks, > > -Philip > > _______________________________________________ > NOTE: If there is a disclaimer or other legal boilerplate in the above > message, it is NULL AND VOID. You may ignore it. > > Visit http://www.mimedefang.org and http://www.roaringpenguin.com > MIMEDefang mailing list MIMEDefang at lists.roaringpenguin.com > http://lists.roaringpenguin.com/mailman/listinfo/mimedefang From dfs at roaringpenguin.com Wed Feb 20 20:21:49 2013 From: dfs at roaringpenguin.com (David F. Skoll) Date: Wed, 20 Feb 2013 20:21:49 -0500 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> References: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> Message-ID: <20130220202149.60a4e9c3@shishi.roaringpenguin.com> On Wed, 20 Feb 2013 17:20:25 -0700 Philip Prindeville wrote: > utf8::upgrade($string); UTF-8 handling in Perl is tricky. If you use utf8::upgrade, it is almost certainly a bug. I have never encountered a situation in which utf8::upgrade was appropriate. Try putting "use Encode;" near the top of your test file and replacing utf8::upgrade($string); with: $string = Encode::decode('utf-8', $string); When I did that, the resulting output was: Ellipsis=E2=80=A6 Now study the Encode man page very carefully... Regards, David. From philipp_subx at redfish-solutions.com Wed Feb 20 21:20:56 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Wed, 20 Feb 2013 19:20:56 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? Message-ID: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> Awesome, that worked! I'm wondering if in MIME::Body we should take: sub as_string { my $self = shift; my $str = ''; my $fh = IO::File->new(\$str, '>:') or croak("Cannot open in-memory file: $!"); $self->print($fh); close($fh); return $str; } and have: return Encode::decode($charset, $str); instead, but I'm not sure how we'd retrieve $charset? It would need to be stored into MIME::Body which isn't currently the case. Thanks, -Philip On Feb 20, 2013, at 6:21 PM, David F. Skoll wrote: > > Try putting "use Encode;" near the top of your test file and replacing > > utf8::upgrade($string); > > with: > > $string = Encode::decode('utf-8', $string); From skmimedefang at smail.inf.fh-bonn-rhein-sieg.de Thu Feb 21 05:36:41 2013 From: skmimedefang at smail.inf.fh-bonn-rhein-sieg.de (Steffen Kaiser) Date: Thu, 21 Feb 2013 11:36:41 +0100 (CET) Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> References: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, 20 Feb 2013, Philip Prindeville wrote: > Awesome, that worked! > > I'm wondering if in MIME::Body we should take: > > sub as_string { > my $self = shift; > my $str = ''; > my $fh = IO::File->new(\$str, '>:') or croak("Cannot open in-memory file: $!"); > $self->print($fh); > close($fh); > return $str; > } > > and have: > > return Encode::decode($charset, $str); I suppose that violates the internals of the MIME:: and Mail:: namespace functions. They are tied together very closly. Actually, I looked into a UTF8 MIMEtools a few years back to overcome character set problems when storing header data into a postgres database. I thought that everything the MIME:: functions should return would be in Perl utf8, any character set information already decoded. Anything the functions get passed into is Perl internal utf-8 as well. I think one would need to rewrite the whole framework anew. > instead, but I'm not sure how we'd retrieve $charset? It would need to be stored into MIME::Body which isn't currently the case. Encode is a tricky module by its own, perldoc Encode: "Handling Malformed Data The optional CHECK argument tells Encode what to do when it encounters malformed data. Without CHECK, Encode::FB_DEFAULT ( == 0 ) is assumed. As of version 2.12 Encode supports coderef values for CHECK. See below. NOTE: Not all encoding support this feature Some encodings ignore CHECK argument. For example, Encode::Unicode ignores CHECK and it always croaks on error. " Some encodings modify the $str argument to return the characters NOT decoded. So you'd call Encode::decode($charset, "".$str) to enforce a copy - - but have the performance penalty. I also got weired results with decode('latin1', $str). I guess because of "CAVEAT: When you run "$string = decode("utf8", $octets)", then $string may not be equal to $octets. Though they both contain the same data, the UTF8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines)." When I pass results of decode('latin1', $str) to LDAP or Postgres, I sometimes get errors. I pass all strings through a function now, that looks terrible, but since then Web, Postgres, LDAP and text files play together. > On Feb 20, 2013, at 6:21 PM, David F. Skoll wrote: >> Try putting "use Encode;" near the top of your test file and replacing >> >> utf8::upgrade($string); >> >> with: >> >> $string = Encode::decode('utf-8', $string); In fact, I found that utf8::upgrade() works for me in order to replace decode('latin1'), which seems to "do nothing", causing other modules, like Net::LDAP or DBD::Pg, to pass invalid UTF8 to the services. - -- Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQEVAwUBUSX4uZ8mjdm1m0FfAQJLPAf9EPC0E+gm5cJ4PvwxQHT2MzGoTmfLz1/C nd7kihJnCqmWHQeYLhRlETqX4D1vG/ZGS6WbaP8Fybn400Tfb4JZBs9kZafS7dri z3r6wk70Vd0By7GM5zIPlTbovU7HqiIFBBoHrdLkaSvzGq95ZfyH5u8aZjj39D85 2nDracTpxp9VF1rsgDi9I3z2lJpRjtJsufVUTvIhynOghQoAhw0S8FEAp7CrLnOX UHsTTW1+CPhJA3zxY7jgGKV65smNYjtB4MZ1D0cxq2Y6Op7R2NmbRZrlXfFsfMBs ah7y6nOmlOOpJ1oG760qZY31GjAcvuHgzcliV6rBXueMb1qSM3yHyw== =A/mV -----END PGP SIGNATURE----- From philipp_subx at redfish-solutions.com Thu Feb 21 12:35:51 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Thu, 21 Feb 2013 10:35:51 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: References: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> Message-ID: <7F0FA6BA-D3FD-471D-8313-28BB48DED0D5@redfish-solutions.com> On Feb 21, 2013, at 3:36 AM, Steffen Kaiser wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Wed, 20 Feb 2013, Philip Prindeville wrote: > > I suppose that violates the internals of the MIME:: and Mail:: namespace functions. They are tied together very closely. Not sure I follow. MIME:: can't have external dependencies? > > Actually, I looked into a UTF8 MIMEtools a few years back to overcome character set problems when storing header data into a postgres database. I thought that everything the MIME:: functions should return would be in Perl utf8, any character set information already decoded. Anything the functions get passed into is Perl internal utf-8 as well. I think one would need to rewrite the whole framework anew. It seems a reasonable goal to have all strings be stored in internal format, except for functions that explicitly generate "on-the-wire" formatted strings. Could this also be controlled by either a global gating variable that controls a semantic, or by detecting if the caller has "utf8" loaded into this space? Of course, things get complicated when MIME:: is used by some intermediate module (Foo:: for instance) which uses utf8 but the user's main' doesn't? > > Encode is a tricky module by its own, perldoc Encode: > > [?] > > Some encodings modify the $str argument to return the characters NOT decoded. So you'd call Encode::decode($charset, "".$str) to enforce a copy - - but have the performance penalty. That seems to be an issue with Encode not having consistent semantics. That might be fixed separately. Encode is part of perl core, isn't it? > > [?] > > I pass all strings through a function now, that looks terrible, but since then Web, Postgres, LDAP and text files play together. And would fixing the semantics of Encode to be more uniform fix that, or is this an orthogonal issue? -Philip From Mark at Misty.com Wed Feb 6 17:00:55 2013 From: Mark at Misty.com (Mark G Thomas) Date: Wed, 6 Feb 2013 17:00:55 -0500 Subject: [Mimedefang] main::rebuild_entity() called too early to check prototype Message-ID: <20130206220055.GA27440@allie.home.misty.com> Hi, I'm running into some issues with perl-5.16.2 and mimedefang. Errors like this: defined(@array) is deprecated at /opt/mimedefang-2.67/bin/mimedefang.pl line 7335. and this: mimedefang-multiplexor[11932]: Slave 5 stderr: main::rebuild_entity() called too early to check prototype at /opt/mimedefang-2.73/bin/mimedefang.pl line 805. The first one seems fixed by just changing to "if (!@arraya)", but I'm not sure how to fix the other error. Mark -- Mark G. Thomas (Mark at Misty.com) From dfs at roaringpenguin.com Wed Feb 6 17:42:58 2013 From: dfs at roaringpenguin.com (David F. Skoll) Date: Wed, 6 Feb 2013 17:42:58 -0500 Subject: [Mimedefang] main::rebuild_entity() called too early to check prototype In-Reply-To: <20130206220055.GA27440@allie.home.misty.com> References: <20130206220055.GA27440@allie.home.misty.com> Message-ID: <20130206174258.1c350575@shishi.roaringpenguin.com> On Wed, 6 Feb 2013 17:00:55 -0500 Mark G Thomas wrote: > mimedefang-multiplexor[11932]: Slave 5 stderr: main::rebuild_entity() > called too early to check prototype > at /opt/mimedefang-2.73/bin/mimedefang.pl line 805. > The first one seems fixed by just changing to "if (!@arraya)", but > I'm not sure how to fix the other error. You can fix it by putting this line: sub rebuild_entity ($$); # Declare for forward declaration right before the existing line: sub rebuild_entity ($$) { (You'll end up with both lines in the file; see collect_parts for an example.) Regards, David. From philipp_subx at redfish-solutions.com Wed Feb 20 19:20:25 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Wed, 20 Feb 2013 17:20:25 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? Message-ID: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> Hi. I'm trying to generate a message as a footer in mimedefang-filter (in filter_end()) when I see certain message contents, but I'm running into what looks like a bug. I've reproduced it here: [philipp]$ cat test.pl #!/usr/bin/perl -w use strict; use warnings; use MIME::Entity; use MIME::QuotedPrint; use HTML::Entities; my $string = decode_qp("Ellipsis=E2=80=A6\n"); utf8::upgrade($string); print "string: ", $string; print "hex: ", unpack('H*', $string), "\n"; my $msg = encode_entities($string, '"<>&'); my @strings = ( "\n", $msg, "\n" ); my $html = MIME::Entity->build( Top => 0, Type => 'text/html', Encoding => 'quoted-printable', Charset => 'utf-8', Data => [ @strings ], ); print $html->as_string(), "\n"; exit 0; [philipp]$ ./test.pl string: Ellipsis? hex: 456c6c6970736973e280a60a Content-Type: text/html; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Ellipsis=C3=A2=C2=80=C2=A6 [philipp]$ from what I can tell, if I do a Data::Dumper() on $html->bodyhandle()->{'MBS_Data'} then it looks like the 3 UTF characters (0xa280a8) have been converted into \x{a2}, \x{80}, \x{a8} instead? Which I don't understand, since I've explicitly called the Charset out as being 'utf-8'. It looks like the string is being interpreted as latin1, not utf8. What am I doing wrong? Thanks, -Philip From philipp_subx at redfish-solutions.com Wed Feb 20 19:55:23 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Wed, 20 Feb 2013 17:55:23 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> References: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> Message-ID: <2BD5D396-4C4B-4E4B-A5A0-89F1496056F9@redfish-solutions.com> Should probably mention I'm running: perl-MIME-Types-1.28-2.el6.noarch perl-MIME-Lite-3.027-2.el6.noarch perl-MIME-tools-5.427-4.el6.noarch perl-MIME-Base32-1.02a-1.el6.noarch mimedefang-2.73-3.el6.x86_64 perl-5.10.1-127.el6.x86_64 on CentOS 6.3. On Feb 20, 2013, at 5:20 PM, Philip Prindeville wrote: > Hi. > > I'm trying to generate a message as a footer in mimedefang-filter (in filter_end()) when I see certain message contents, but I'm running into what looks like a bug. I've reproduced it here: > > [philipp]$ cat test.pl > #!/usr/bin/perl -w > > use strict; > use warnings; > > use MIME::Entity; > use MIME::QuotedPrint; > use HTML::Entities; > > my $string = decode_qp("Ellipsis=E2=80=A6\n"); > > utf8::upgrade($string); > > print "string: ", $string; > > print "hex: ", unpack('H*', $string), "\n"; > > my $msg = encode_entities($string, '"<>&'); > > my @strings = ( > "\n", > $msg, > "\n" > ); > > > my $html = MIME::Entity->build( > Top => 0, > Type => 'text/html', > Encoding => 'quoted-printable', > Charset => 'utf-8', > Data => [ @strings ], > ); > > print $html->as_string(), "\n"; > > exit 0; > [philipp]$ ./test.pl > string: Ellipsis? > hex: 456c6c6970736973e280a60a > Content-Type: text/html; charset="utf-8" > Content-Disposition: inline > Content-Transfer-Encoding: quoted-printable > > > Ellipsis=C3=A2=C2=80=C2=A6 > > > [philipp]$ > > > from what I can tell, if I do a Data::Dumper() on $html->bodyhandle()->{'MBS_Data'} then it looks like the 3 UTF characters (0xa280a8) have been converted into \x{a2}, \x{80}, \x{a8} instead? Which I don't understand, since I've explicitly called the Charset out as being 'utf-8'. It looks like the string is being interpreted as latin1, not utf8. > > What am I doing wrong? > > Thanks, > > -Philip > > _______________________________________________ > NOTE: If there is a disclaimer or other legal boilerplate in the above > message, it is NULL AND VOID. You may ignore it. > > Visit http://www.mimedefang.org and http://www.roaringpenguin.com > MIMEDefang mailing list MIMEDefang at lists.roaringpenguin.com > http://lists.roaringpenguin.com/mailman/listinfo/mimedefang From dfs at roaringpenguin.com Wed Feb 20 20:21:49 2013 From: dfs at roaringpenguin.com (David F. Skoll) Date: Wed, 20 Feb 2013 20:21:49 -0500 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> References: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> Message-ID: <20130220202149.60a4e9c3@shishi.roaringpenguin.com> On Wed, 20 Feb 2013 17:20:25 -0700 Philip Prindeville wrote: > utf8::upgrade($string); UTF-8 handling in Perl is tricky. If you use utf8::upgrade, it is almost certainly a bug. I have never encountered a situation in which utf8::upgrade was appropriate. Try putting "use Encode;" near the top of your test file and replacing utf8::upgrade($string); with: $string = Encode::decode('utf-8', $string); When I did that, the resulting output was: Ellipsis=E2=80=A6 Now study the Encode man page very carefully... Regards, David. From philipp_subx at redfish-solutions.com Wed Feb 20 21:20:56 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Wed, 20 Feb 2013 19:20:56 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? Message-ID: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> Awesome, that worked! I'm wondering if in MIME::Body we should take: sub as_string { my $self = shift; my $str = ''; my $fh = IO::File->new(\$str, '>:') or croak("Cannot open in-memory file: $!"); $self->print($fh); close($fh); return $str; } and have: return Encode::decode($charset, $str); instead, but I'm not sure how we'd retrieve $charset? It would need to be stored into MIME::Body which isn't currently the case. Thanks, -Philip On Feb 20, 2013, at 6:21 PM, David F. Skoll wrote: > > Try putting "use Encode;" near the top of your test file and replacing > > utf8::upgrade($string); > > with: > > $string = Encode::decode('utf-8', $string); From skmimedefang at smail.inf.fh-bonn-rhein-sieg.de Thu Feb 21 05:36:41 2013 From: skmimedefang at smail.inf.fh-bonn-rhein-sieg.de (Steffen Kaiser) Date: Thu, 21 Feb 2013 11:36:41 +0100 (CET) Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> References: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, 20 Feb 2013, Philip Prindeville wrote: > Awesome, that worked! > > I'm wondering if in MIME::Body we should take: > > sub as_string { > my $self = shift; > my $str = ''; > my $fh = IO::File->new(\$str, '>:') or croak("Cannot open in-memory file: $!"); > $self->print($fh); > close($fh); > return $str; > } > > and have: > > return Encode::decode($charset, $str); I suppose that violates the internals of the MIME:: and Mail:: namespace functions. They are tied together very closly. Actually, I looked into a UTF8 MIMEtools a few years back to overcome character set problems when storing header data into a postgres database. I thought that everything the MIME:: functions should return would be in Perl utf8, any character set information already decoded. Anything the functions get passed into is Perl internal utf-8 as well. I think one would need to rewrite the whole framework anew. > instead, but I'm not sure how we'd retrieve $charset? It would need to be stored into MIME::Body which isn't currently the case. Encode is a tricky module by its own, perldoc Encode: "Handling Malformed Data The optional CHECK argument tells Encode what to do when it encounters malformed data. Without CHECK, Encode::FB_DEFAULT ( == 0 ) is assumed. As of version 2.12 Encode supports coderef values for CHECK. See below. NOTE: Not all encoding support this feature Some encodings ignore CHECK argument. For example, Encode::Unicode ignores CHECK and it always croaks on error. " Some encodings modify the $str argument to return the characters NOT decoded. So you'd call Encode::decode($charset, "".$str) to enforce a copy - - but have the performance penalty. I also got weired results with decode('latin1', $str). I guess because of "CAVEAT: When you run "$string = decode("utf8", $octets)", then $string may not be equal to $octets. Though they both contain the same data, the UTF8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines)." When I pass results of decode('latin1', $str) to LDAP or Postgres, I sometimes get errors. I pass all strings through a function now, that looks terrible, but since then Web, Postgres, LDAP and text files play together. > On Feb 20, 2013, at 6:21 PM, David F. Skoll wrote: >> Try putting "use Encode;" near the top of your test file and replacing >> >> utf8::upgrade($string); >> >> with: >> >> $string = Encode::decode('utf-8', $string); In fact, I found that utf8::upgrade() works for me in order to replace decode('latin1'), which seems to "do nothing", causing other modules, like Net::LDAP or DBD::Pg, to pass invalid UTF8 to the services. - -- Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQEVAwUBUSX4uZ8mjdm1m0FfAQJLPAf9EPC0E+gm5cJ4PvwxQHT2MzGoTmfLz1/C nd7kihJnCqmWHQeYLhRlETqX4D1vG/ZGS6WbaP8Fybn400Tfb4JZBs9kZafS7dri z3r6wk70Vd0By7GM5zIPlTbovU7HqiIFBBoHrdLkaSvzGq95ZfyH5u8aZjj39D85 2nDracTpxp9VF1rsgDi9I3z2lJpRjtJsufVUTvIhynOghQoAhw0S8FEAp7CrLnOX UHsTTW1+CPhJA3zxY7jgGKV65smNYjtB4MZ1D0cxq2Y6Op7R2NmbRZrlXfFsfMBs ah7y6nOmlOOpJ1oG760qZY31GjAcvuHgzcliV6rBXueMb1qSM3yHyw== =A/mV -----END PGP SIGNATURE----- From philipp_subx at redfish-solutions.com Thu Feb 21 12:35:51 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Thu, 21 Feb 2013 10:35:51 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: References: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> Message-ID: <7F0FA6BA-D3FD-471D-8313-28BB48DED0D5@redfish-solutions.com> On Feb 21, 2013, at 3:36 AM, Steffen Kaiser wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Wed, 20 Feb 2013, Philip Prindeville wrote: > > I suppose that violates the internals of the MIME:: and Mail:: namespace functions. They are tied together very closely. Not sure I follow. MIME:: can't have external dependencies? > > Actually, I looked into a UTF8 MIMEtools a few years back to overcome character set problems when storing header data into a postgres database. I thought that everything the MIME:: functions should return would be in Perl utf8, any character set information already decoded. Anything the functions get passed into is Perl internal utf-8 as well. I think one would need to rewrite the whole framework anew. It seems a reasonable goal to have all strings be stored in internal format, except for functions that explicitly generate "on-the-wire" formatted strings. Could this also be controlled by either a global gating variable that controls a semantic, or by detecting if the caller has "utf8" loaded into this space? Of course, things get complicated when MIME:: is used by some intermediate module (Foo:: for instance) which uses utf8 but the user's main' doesn't? > > Encode is a tricky module by its own, perldoc Encode: > > [?] > > Some encodings modify the $str argument to return the characters NOT decoded. So you'd call Encode::decode($charset, "".$str) to enforce a copy - - but have the performance penalty. That seems to be an issue with Encode not having consistent semantics. That might be fixed separately. Encode is part of perl core, isn't it? > > [?] > > I pass all strings through a function now, that looks terrible, but since then Web, Postgres, LDAP and text files play together. And would fixing the semantics of Encode to be more uniform fix that, or is this an orthogonal issue? -Philip From Mark at Misty.com Wed Feb 6 17:00:55 2013 From: Mark at Misty.com (Mark G Thomas) Date: Wed, 6 Feb 2013 17:00:55 -0500 Subject: [Mimedefang] main::rebuild_entity() called too early to check prototype Message-ID: <20130206220055.GA27440@allie.home.misty.com> Hi, I'm running into some issues with perl-5.16.2 and mimedefang. Errors like this: defined(@array) is deprecated at /opt/mimedefang-2.67/bin/mimedefang.pl line 7335. and this: mimedefang-multiplexor[11932]: Slave 5 stderr: main::rebuild_entity() called too early to check prototype at /opt/mimedefang-2.73/bin/mimedefang.pl line 805. The first one seems fixed by just changing to "if (!@arraya)", but I'm not sure how to fix the other error. Mark -- Mark G. Thomas (Mark at Misty.com) From dfs at roaringpenguin.com Wed Feb 6 17:42:58 2013 From: dfs at roaringpenguin.com (David F. Skoll) Date: Wed, 6 Feb 2013 17:42:58 -0500 Subject: [Mimedefang] main::rebuild_entity() called too early to check prototype In-Reply-To: <20130206220055.GA27440@allie.home.misty.com> References: <20130206220055.GA27440@allie.home.misty.com> Message-ID: <20130206174258.1c350575@shishi.roaringpenguin.com> On Wed, 6 Feb 2013 17:00:55 -0500 Mark G Thomas wrote: > mimedefang-multiplexor[11932]: Slave 5 stderr: main::rebuild_entity() > called too early to check prototype > at /opt/mimedefang-2.73/bin/mimedefang.pl line 805. > The first one seems fixed by just changing to "if (!@arraya)", but > I'm not sure how to fix the other error. You can fix it by putting this line: sub rebuild_entity ($$); # Declare for forward declaration right before the existing line: sub rebuild_entity ($$) { (You'll end up with both lines in the file; see collect_parts for an example.) Regards, David. From philipp_subx at redfish-solutions.com Wed Feb 20 19:20:25 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Wed, 20 Feb 2013 17:20:25 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? Message-ID: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> Hi. I'm trying to generate a message as a footer in mimedefang-filter (in filter_end()) when I see certain message contents, but I'm running into what looks like a bug. I've reproduced it here: [philipp]$ cat test.pl #!/usr/bin/perl -w use strict; use warnings; use MIME::Entity; use MIME::QuotedPrint; use HTML::Entities; my $string = decode_qp("Ellipsis=E2=80=A6\n"); utf8::upgrade($string); print "string: ", $string; print "hex: ", unpack('H*', $string), "\n"; my $msg = encode_entities($string, '"<>&'); my @strings = ( "\n", $msg, "\n" ); my $html = MIME::Entity->build( Top => 0, Type => 'text/html', Encoding => 'quoted-printable', Charset => 'utf-8', Data => [ @strings ], ); print $html->as_string(), "\n"; exit 0; [philipp]$ ./test.pl string: Ellipsis? hex: 456c6c6970736973e280a60a Content-Type: text/html; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Ellipsis=C3=A2=C2=80=C2=A6 [philipp]$ from what I can tell, if I do a Data::Dumper() on $html->bodyhandle()->{'MBS_Data'} then it looks like the 3 UTF characters (0xa280a8) have been converted into \x{a2}, \x{80}, \x{a8} instead? Which I don't understand, since I've explicitly called the Charset out as being 'utf-8'. It looks like the string is being interpreted as latin1, not utf8. What am I doing wrong? Thanks, -Philip From philipp_subx at redfish-solutions.com Wed Feb 20 19:55:23 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Wed, 20 Feb 2013 17:55:23 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> References: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> Message-ID: <2BD5D396-4C4B-4E4B-A5A0-89F1496056F9@redfish-solutions.com> Should probably mention I'm running: perl-MIME-Types-1.28-2.el6.noarch perl-MIME-Lite-3.027-2.el6.noarch perl-MIME-tools-5.427-4.el6.noarch perl-MIME-Base32-1.02a-1.el6.noarch mimedefang-2.73-3.el6.x86_64 perl-5.10.1-127.el6.x86_64 on CentOS 6.3. On Feb 20, 2013, at 5:20 PM, Philip Prindeville wrote: > Hi. > > I'm trying to generate a message as a footer in mimedefang-filter (in filter_end()) when I see certain message contents, but I'm running into what looks like a bug. I've reproduced it here: > > [philipp]$ cat test.pl > #!/usr/bin/perl -w > > use strict; > use warnings; > > use MIME::Entity; > use MIME::QuotedPrint; > use HTML::Entities; > > my $string = decode_qp("Ellipsis=E2=80=A6\n"); > > utf8::upgrade($string); > > print "string: ", $string; > > print "hex: ", unpack('H*', $string), "\n"; > > my $msg = encode_entities($string, '"<>&'); > > my @strings = ( > "\n", > $msg, > "\n" > ); > > > my $html = MIME::Entity->build( > Top => 0, > Type => 'text/html', > Encoding => 'quoted-printable', > Charset => 'utf-8', > Data => [ @strings ], > ); > > print $html->as_string(), "\n"; > > exit 0; > [philipp]$ ./test.pl > string: Ellipsis? > hex: 456c6c6970736973e280a60a > Content-Type: text/html; charset="utf-8" > Content-Disposition: inline > Content-Transfer-Encoding: quoted-printable > > > Ellipsis=C3=A2=C2=80=C2=A6 > > > [philipp]$ > > > from what I can tell, if I do a Data::Dumper() on $html->bodyhandle()->{'MBS_Data'} then it looks like the 3 UTF characters (0xa280a8) have been converted into \x{a2}, \x{80}, \x{a8} instead? Which I don't understand, since I've explicitly called the Charset out as being 'utf-8'. It looks like the string is being interpreted as latin1, not utf8. > > What am I doing wrong? > > Thanks, > > -Philip > > _______________________________________________ > NOTE: If there is a disclaimer or other legal boilerplate in the above > message, it is NULL AND VOID. You may ignore it. > > Visit http://www.mimedefang.org and http://www.roaringpenguin.com > MIMEDefang mailing list MIMEDefang at lists.roaringpenguin.com > http://lists.roaringpenguin.com/mailman/listinfo/mimedefang From dfs at roaringpenguin.com Wed Feb 20 20:21:49 2013 From: dfs at roaringpenguin.com (David F. Skoll) Date: Wed, 20 Feb 2013 20:21:49 -0500 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> References: <582EA0B3-DD1B-4308-A56F-A82C40751B0E@redfish-solutions.com> Message-ID: <20130220202149.60a4e9c3@shishi.roaringpenguin.com> On Wed, 20 Feb 2013 17:20:25 -0700 Philip Prindeville wrote: > utf8::upgrade($string); UTF-8 handling in Perl is tricky. If you use utf8::upgrade, it is almost certainly a bug. I have never encountered a situation in which utf8::upgrade was appropriate. Try putting "use Encode;" near the top of your test file and replacing utf8::upgrade($string); with: $string = Encode::decode('utf-8', $string); When I did that, the resulting output was: Ellipsis=E2=80=A6 Now study the Encode man page very carefully... Regards, David. From philipp_subx at redfish-solutions.com Wed Feb 20 21:20:56 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Wed, 20 Feb 2013 19:20:56 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? Message-ID: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> Awesome, that worked! I'm wondering if in MIME::Body we should take: sub as_string { my $self = shift; my $str = ''; my $fh = IO::File->new(\$str, '>:') or croak("Cannot open in-memory file: $!"); $self->print($fh); close($fh); return $str; } and have: return Encode::decode($charset, $str); instead, but I'm not sure how we'd retrieve $charset? It would need to be stored into MIME::Body which isn't currently the case. Thanks, -Philip On Feb 20, 2013, at 6:21 PM, David F. Skoll wrote: > > Try putting "use Encode;" near the top of your test file and replacing > > utf8::upgrade($string); > > with: > > $string = Encode::decode('utf-8', $string); From skmimedefang at smail.inf.fh-bonn-rhein-sieg.de Thu Feb 21 05:36:41 2013 From: skmimedefang at smail.inf.fh-bonn-rhein-sieg.de (Steffen Kaiser) Date: Thu, 21 Feb 2013 11:36:41 +0100 (CET) Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> References: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, 20 Feb 2013, Philip Prindeville wrote: > Awesome, that worked! > > I'm wondering if in MIME::Body we should take: > > sub as_string { > my $self = shift; > my $str = ''; > my $fh = IO::File->new(\$str, '>:') or croak("Cannot open in-memory file: $!"); > $self->print($fh); > close($fh); > return $str; > } > > and have: > > return Encode::decode($charset, $str); I suppose that violates the internals of the MIME:: and Mail:: namespace functions. They are tied together very closly. Actually, I looked into a UTF8 MIMEtools a few years back to overcome character set problems when storing header data into a postgres database. I thought that everything the MIME:: functions should return would be in Perl utf8, any character set information already decoded. Anything the functions get passed into is Perl internal utf-8 as well. I think one would need to rewrite the whole framework anew. > instead, but I'm not sure how we'd retrieve $charset? It would need to be stored into MIME::Body which isn't currently the case. Encode is a tricky module by its own, perldoc Encode: "Handling Malformed Data The optional CHECK argument tells Encode what to do when it encounters malformed data. Without CHECK, Encode::FB_DEFAULT ( == 0 ) is assumed. As of version 2.12 Encode supports coderef values for CHECK. See below. NOTE: Not all encoding support this feature Some encodings ignore CHECK argument. For example, Encode::Unicode ignores CHECK and it always croaks on error. " Some encodings modify the $str argument to return the characters NOT decoded. So you'd call Encode::decode($charset, "".$str) to enforce a copy - - but have the performance penalty. I also got weired results with decode('latin1', $str). I guess because of "CAVEAT: When you run "$string = decode("utf8", $octets)", then $string may not be equal to $octets. Though they both contain the same data, the UTF8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines)." When I pass results of decode('latin1', $str) to LDAP or Postgres, I sometimes get errors. I pass all strings through a function now, that looks terrible, but since then Web, Postgres, LDAP and text files play together. > On Feb 20, 2013, at 6:21 PM, David F. Skoll wrote: >> Try putting "use Encode;" near the top of your test file and replacing >> >> utf8::upgrade($string); >> >> with: >> >> $string = Encode::decode('utf-8', $string); In fact, I found that utf8::upgrade() works for me in order to replace decode('latin1'), which seems to "do nothing", causing other modules, like Net::LDAP or DBD::Pg, to pass invalid UTF8 to the services. - -- Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQEVAwUBUSX4uZ8mjdm1m0FfAQJLPAf9EPC0E+gm5cJ4PvwxQHT2MzGoTmfLz1/C nd7kihJnCqmWHQeYLhRlETqX4D1vG/ZGS6WbaP8Fybn400Tfb4JZBs9kZafS7dri z3r6wk70Vd0By7GM5zIPlTbovU7HqiIFBBoHrdLkaSvzGq95ZfyH5u8aZjj39D85 2nDracTpxp9VF1rsgDi9I3z2lJpRjtJsufVUTvIhynOghQoAhw0S8FEAp7CrLnOX UHsTTW1+CPhJA3zxY7jgGKV65smNYjtB4MZ1D0cxq2Y6Op7R2NmbRZrlXfFsfMBs ah7y6nOmlOOpJ1oG760qZY31GjAcvuHgzcliV6rBXueMb1qSM3yHyw== =A/mV -----END PGP SIGNATURE----- From philipp_subx at redfish-solutions.com Thu Feb 21 12:35:51 2013 From: philipp_subx at redfish-solutions.com (Philip Prindeville) Date: Thu, 21 Feb 2013 10:35:51 -0700 Subject: [Mimedefang] MIME::Entity not handling Charset => 'utf-8' correctly? In-Reply-To: References: <9F099E00-1689-42AF-86B9-0C55F321FFCE@redfish-solutions.com> Message-ID: <7F0FA6BA-D3FD-471D-8313-28BB48DED0D5@redfish-solutions.com> On Feb 21, 2013, at 3:36 AM, Steffen Kaiser wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Wed, 20 Feb 2013, Philip Prindeville wrote: > > I suppose that violates the internals of the MIME:: and Mail:: namespace functions. They are tied together very closely. Not sure I follow. MIME:: can't have external dependencies? > > Actually, I looked into a UTF8 MIMEtools a few years back to overcome character set problems when storing header data into a postgres database. I thought that everything the MIME:: functions should return would be in Perl utf8, any character set information already decoded. Anything the functions get passed into is Perl internal utf-8 as well. I think one would need to rewrite the whole framework anew. It seems a reasonable goal to have all strings be stored in internal format, except for functions that explicitly generate "on-the-wire" formatted strings. Could this also be controlled by either a global gating variable that controls a semantic, or by detecting if the caller has "utf8" loaded into this space? Of course, things get complicated when MIME:: is used by some intermediate module (Foo:: for instance) which uses utf8 but the user's main' doesn't? > > Encode is a tricky module by its own, perldoc Encode: > > [?] > > Some encodings modify the $str argument to return the characters NOT decoded. So you'd call Encode::decode($charset, "".$str) to enforce a copy - - but have the performance penalty. That seems to be an issue with Encode not having consistent semantics. That might be fixed separately. Encode is part of perl core, isn't it? > > [?] > > I pass all strings through a function now, that looks terrible, but since then Web, Postgres, LDAP and text files play together. And would fixing the semantics of Encode to be more uniform fix that, or is this an orthogonal issue? -Philip