- If you want to handle Unicode and avoid The Unicode Bug, in which your strings sometimes act like they aren't actually Unicode: in perl 5.12+,
use feature 'unicode_strings';
. For older perl, see Unicode::Semantics, or use utf8::upgrade by hand. These methods achieve their task by forcing "the UTF-8 flag" on for the string. - If you want strings in your source text with non-ASCII: save it as a utf-8 encoded file and
use utf8;
. Or you can encode Unicode code points with hex-escapes,\xae
→ ®, or\x{30ab}
→ カ. There are technically other options, which have additional drawbacks (utf-16 breaks the #! line; latin-1 is restricted to latin-1 unless you decode it yourself.) - If you want to print to a UTF-8 aware environment like your terminal emulator or CGI STDOUT after issuing a
Content-Type: text/html; charset=utf-8
header: setting UTF-8 on the filehandle withbinmode(STDOUT, ':utf8')
is the minimum, but:encoding(utf-8)
instead of:utf8
makes stricter guarantees that real code points are coming out. - If you want to read a UTF-16 encoded document into a Unicode string with minimal fuss:
open(FH, '< :encoding(utf-16)', $name)
. Note that the document has to be correctly encoded. You can use the Encode module'sdecode
function if you need finer control over error behavior, but that's naturally more fuss:use Encode; open(FH, '<', $name); while (<fh>) { $line = decode($_, 'utf-16', $POLICY); ... }
- If you want to convert a Unicode string to a specific set of bytes for some encoding-unaware module to throw on the wire, use the encode function from the Encode module:
use Encode; $message->attr('content-type.charset', 'utf-16'); $message->data(encode("UTF-16", $body));
(This example would be for MIME::Lite, if you're curious.) - If you want to read a file encoded with charset X, into a string encoded with charset Y, I've found no instant way to do this. It's probably best to pass the input-encoding along as the output-encoding if at all possible. But you might find the Encode module's
from_to()
, or string-IO as inIO::File->new(\$out, '>:')
, or maybe a whole PerlIO filter as in PerlIO::code helpful if you can't. - If you see "Wide character in ..." warnings, then you passed a string with code points >=0x100 to something that expected a byte string of some sort: either really latin-1, or an encoded string.
- If you see longer strings of gibberish where you expected sensible non-ASCII characters, then you have probably double-encoded, either literally, or by printing an encoded string to a filehandle which does encoding.
- If you see the Unicode replacement character in a stream that should be UTF-8, you haven't encoded at all, such as printing a byte string on a raw filehandle in an environment expecting UTF-8. Most likely, the filehandle should have an encoding set on it, per point #3 above, though that may cause #8 on other strings you've printed.
- If you are using modules, they each may or may not deal with Unicode. DBD::mysql has the
mysql_enable_utf8
option; Email::MIME accepts encoded strings via body, and decoded ones through body_str, but for the latter, you must also set the charset and encoding attributes (which correspond to the charset of Content-Type, and the Content-Transfer-Encoding, respectively.) MIME::Lite does not handle decoded strings at all and hopes for the best.
Wednesday, January 18, 2012
Perl and Unicode in Brief
Perl requires a knob for every I/O, and expects you to set them all correctly yourself. By default, they're all off (Unicode-unaware) for backwards compatibility.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment