Hi,

I've just spent quite a while tracking down a problem with a web page generated by a mod_perl program in which 8-bit ISO-8859-1 characters were not being shown properly. The software runs via Apache::Registry, and works fine under mod_cgi.

It turns out that the problem is due to a difference in behaviour between Perl's built-in print() function in Perl 5.8.0+ and the Apache->print() method that mod_perl overrides it with. I've consulted the documentation on the mod_perl website, and could find no mention of the difference. If my conclusions below are correct then this information may well be worth adding.

Under Perl 5.8.0, if a string stored in Perl's internal UTF-8 format is passed to print() then by default it will be converted to the machine's native 8-bit character set on output to STDOUT. In my case, this is exactly as if I had called binmode(STDOUT, ':encoding(iso-8859-1)') before the print(). (If any characters in the UTF-8 string are not representable in ISO-8859-1 then a "Wide character in print()" warning will be emitted, and the bytes that make up that UTF-8 character will be output.)

However, mod_perl's Apache->print() method does not perform this automatic conversion. It simply prints the bytes that make up each UTF-8 character (i.e. it outputs the UTF-8 string as UTF-8), exactly as if you had called binmode(STDOUT, ':utf8') before Apache->print(). (No "Wide character in print()" warnings are produced for charcaters with code points > 0xFF either.)

The test program below illustrates this difference:

   use 5.008;
   use strict;
   use warnings;
   use Encode;

   my $cset = 'ISO-8859-1';
   #my $cset = 'UTF-8';

   print "Content-type: text/html; charset=$cset\n\n";
   print <<EOT;
   <html>
   <head>
   <meta http-equiv="Content-type" content="text/html; charset=$cset">
   </head>
   <body>
   EOT

   # $str is stored in Perl's internal UTF-8 format.
   my $str = Encode::decode('iso-8859-1', 'Zurück');
   print "<p>$str</p>\n";

   print <<EOT;
   </body>
   </html>
   EOT

Running under mod_cgi (using Perl's built-in print() function) the UTF-8 encoded data in $str is converted to ISO-8859-1 on-the-fly by the print(), and the end-user will see the intended output when $cset is ISO-8859-1. (Changing $cset to UTF-8 causes the ü to be replaced with ? in my web browser because the ü which is output is not a valid UTF-8 character (which the output is labelled as).)

Running under mod_perl (with Perl's built-in print() function now overridden by the Apache->print() method) the UTF-8 encoded data in $str is NOT converted to ISO-8859-1 on-the-fly as it is printed, and the end-user will see the two bytes that make up the UTF-8 representation of ü when $cset is ISO-8859-1. Changing $cset to UTF-8 in this case "fixes" it, because the output stream in this case happens to be valid UTF-8 all the way through.

There are two solutions to this:

1. To use $cset = 'ISO-8859-1': Explicitly convert the UTF-8 data in $str to ISO-8859-1 yourself before sending it to print(), rather than relying on print() to do that for you. This is, in general, not possible (not all characters in the UTF-8 string may be representable in ISO-8859-1), but for HTML output we can arrange for Encode::encode to convert any non-representable charcaters to their HTML character references:

$str = Encode::encode('iso-8859-1', $str, Encode::FB_HTMLCREF);

2. To use $cset = 'UTF-8': Output UTF-8 directly, ensuring that *all* outgoing data is UTF-8 by adding an appropriate layer on STDOUT:

binmode STDOUT, ':utf8';

The second method here is generally to be preferred, but in the old software that I was experiencing problems with, I was not able to add the utf8 layer to STDOUT reliably (the data was being output from a multitude of print() statements scattered in various places), so I stuck with the first method. I believed that it should work without the explicit encoding to ISO-8859-1 because I was unaware that mod_perl's print() override removed Perl's implicit encoding behaviour. Actually, the explicit encoding above is better anyway because it also handles characters that can't be encoded to ISO-8859-1, but nevertheless I think the difference in mod_perl's print() is still worth mentioning in the documentation somewhere.

Cheers,

Steve



Reply via email to