[Mimedefang] unicode problem in mimedefang

Tue Feb 5 08:25:21 EST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 1 Feb 2008, Jan-Pieter Cornet wrote:

>> i write the value of $fname in some file.
>> it works fine
>> but if filename is in japanese, then garbage characaters are written to file.
>> does mimedegang does not support unicode.

Well, what kind if "garbage" is it? Maybe, the garbage is no garbage at 
all for humans displaying the stuff in another character set or for a 
computer applying some nice decodings about it.

> Nope, mimedefang does nothing with unicode... because perl handles all
> that.

Well, that's not that easy, because:

a) Perl must know the character set, the string consists of.
b) When writing to a file, perl does not default to unicode, but to a 
"local" character set, most probably Latin1.
c) When reading a file, the same happens.
d) When a string is internally in utf8, it does not mean, this is real 
unicode, because possibly the other charset had been interpreted as 
Latin1, then decoded into perl-internal utf8.
(Well, this happens transparently and works only for clean systems.)

The first step is to get the whole system perl runs under use utf8 
correctly, e.g. LC_ALL=en.UTF-8.
Actually, I'm not sure, if this is all to make PerlIO default to utf8, I 
enforce this always by:

open($fh, '<:utf8', ..)

For a) I'm not sure how mimetools handles filenames in alien character 
sets at all. The spec says that headers are in plain ASCII. So either the 
mailis mailformed using 8bit characters or encodes the filename, latter 
can be handled by use MIME::Words qw(decode_mimewords);, then decode all 
parts, e.g.:

sub _unmime ($) {
 	my $l = shift;
 	return undef unless defined $l;
 	my @h = decode_mimewords($l);
 	return $l if $@;	# On error, return original string
 	# Now encode to UTF8
 	my $str = '';
 	my $h;
 	map {
 		$str .= str2utf8($_->[1], $_->[0]);
 	} @h;
 	return $str;
}

The tricky part is str2utf8(), which is some nice working with 
Encode::decode() :

a) sometimes the encoding $_->[0] does not work at all, because the Encode 
module has another idea about the character set, e.g. I often recieve 
stuff MIME-encoded in charset 'iso-2022-jp', which Encode cannot process, 
but as 'shiftjis'.

b) Encode::decode() does not transform Latin1 into internal utf8, but 
keeps its hands off. This might not cause trouble, when writing the info 
into a file, because PerlIO handles this tranlation on-the-fly, but I put 
some mail data into a postgres DB, which bails of non-Unicode characters 
thereof.

c) Encode::decode() does not happen to return correctly, if all the 
characters had been decoded with the given charset.

Bye,

- -- 
Steffen Kaiser
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHqGPD5ThHZhj8SBwRAtrmAJ9491VDrvTinkaEZ3br0PbFv4UJOQCeO1V/
IInWCWDIPv1ue0yvI0LzADM=
=0kSU
-----END PGP SIGNATURE-----