The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

ODF::lpOD_Helper::Unicode - Unicode and ODF::lpOD (and XML::Twig)

SYNOPSIS

  use ODF::lpOD;
  use ODF::lpOD_Helper;
  use feature 'unicode_strings';

INTRODUCTION

We once thought Unicode forced us to fiddle with bytes to handle "international" characters. That thinking came from low-level languages like C.

Perl saved us but it took years before everyone believed. And more years before Perl's Unicode paradigm clearly emerged.

Meanwhile lots of pod and code was written which in hindsight was confused or misleading.

THE PERL UNICODE PARADIGM

  1. "Decode" input from binary into Perl characters as soon as possible
     after getting it from the outside world (e.g. from a disk file).

  2. As much of the application as possible works with Perl characters,
     paying absolutely no attention to encoding.

  3. "Encode" Perl characters into binary data as late as possible,
     just before sending the data out.

See "Not always so tidy" below for more discussion.

ODF::lpOD

For historical reasons ODF::lpOD by default is incompatible with the above paradigm because every method encodes result strings into UTF-8 before returning them to you, and attempts to decode strings you pass in before using them. Therefore you must work with binary octets rather than abstract characters; Regex match, substr(), length(), and comparison with string literals do not work with non-ASCII/latin1 characters. Also, you can't print such already-encoded binary to STDOUT if that file handle auto-encodes because the data will be encoded twice.

use ODF::lpOD_Helper disables ODF::lpOD's internal encoding and decoding. Methods then speak and listen in characters, not octets.

You should also use feature 'unicode_strings'; to avoid problematic aspects of legacy Perl behavior.

IF LEGACY BEHAVIOR IS NEEDED

The import tag :bytes will tell ODF::lpOD_Helper to not disable internal encoding & decoding, i.e. retain the original ODF::lpOD behavior.

It is also possible to toggle between the old and new behavior at run-time: lpod->Huse_octet_strings() will will re-enable implicit decoding/encoding of method arguments (using UTF-8 encoding by default) and lpod->Huse_character_strings() will disable the old behavior and restore transparent Unicode support.

Prior to version 6.000 character mode was not the default, but required a :chars import tag. That tag is now deprecated and produces a warning.

XML::Twig & XML::Parser

XML::Twig's docs wrongly say that it stores strings in memory encoded as UTF-8, implying that a decode operation would be needed to obtain characters; in fact XML::Twig stores strings as characters (the result of the decode done by parse()), and most methods return those Perl characters, not encoded binary.

An exception is XML::Twig's print() and sprint which encode their results. The encoding used is the same as was found when parsing the input file unless you change it, and the generated XML header specifies the encoding used. This makes sense when printing directly to a disk file (via a non-encoding file handle), but will corrupt non-ASCII characters printed to STDOUT if STDOUT is set to auto-encode, as is often the case.

Going the other way, XML::Twig's parse() method expects binary octets as input and uses the "encoding=..." specification in the XML header to know how to decode the remainder of the file.

Not always so tidy

Perl's data model now has two kinds of strings: binary octets, which might or might not be some representation of Unicode characters, and abstract characters which are opaque atoms. In most cases these two kinds should never be combined, see perlunifaq.

Decoding means converting possibly-multiple binary octets into abstract characters, and encoding does the reverse.

Obviously Perl somehow represents abstract characters internally, and it often (not always) uses a variation of UTF-8; as a result, "decoding" or "encoding" between abstract characters and UTF-8 octet streams can be very fast.

Non-Unicode-aware scripts usually "just work" as long as they never see characters outside the ASCII range. This is because, without a decode operation or literal strings with "wide" characters, Perl treats all strings as binary octets; the behaviour of binary strings is usually correct when they hold characters with single-byte representations.

The "wide character" warning

If a non-ASCII/latin1 character gets into a Perl string which is written to a non-encoding file handle, Perl issues a warning because that should never happen. It means the program is both Unicode-unaware and uses characters which require Unicode awareness to work correctly. See perlunifaq.

Unicode-aware programs always encode character strings before writing to disk either explicitly or by using an encoding file handle.

Non-unicode-aware programs (which neither decode or encode), will "just work" with ASCII data as described earlier.

SEE ALSO

Encode, Encode::Locale, PerlIO::locale, perlunifaq

(END)