UTF8::R2 - makes UTF-8 scripting easy for enterprise use or LTS
use UTF8::R2; use UTF8::R2 ver.sion; # match or die use UTF8::R2 qw( RFC3629 ); # m/./ matches RFC3629 codepoint (default) use UTF8::R2 qw( RFC2279 ); # m/./ matches RFC2279 codepoint use UTF8::R2 qw( RFC3629.ja_JP ); # optimized RFC3629 for ja_JP use UTF8::R2 qw( %mb ); # multibyte regex by %mb $result = UTF8::R2::chop(@_) $result = UTF8::R2::chr($_) $result = UTF8::R2::getc(FILEHANDLE) $result = UTF8::R2::index($_, 'ABC', 5) $result = UTF8::R2::lc($_) $result = UTF8::R2::lcfirst($_) $result = UTF8::R2::length($_) $result = UTF8::R2::ord($_) $result = UTF8::R2::qr(qr/$utf8regex/imsxogc) @result = UTF8::R2::reverse(@_) $result = UTF8::R2::reverse(@_) $result = UTF8::R2::reverse() $result = UTF8::R2::rindex($_, 'ABC', 5) @result = UTF8::R2::split(qr/$utf8regex/, $_, 3) $result = UTF8::R2::substr($_, 0, 5) $result = UTF8::R2::tr($_, 'ABC', 'XYZ', 'cdsr') $result = UTF8::R2::uc($_) $result = UTF8::R2::ucfirst($_) use UTF8::R2 qw(%mb); $result = $_ =~ $mb{qr/$utf8regex/imsxogc} $result = $_ =~ s<$mb{qr/before/imsxo}><after>egr
Because this module override nothing, the embedded functions provide octet semantics continue. UTF-8 codepoint semantics is provided by the new subroutine name.
------------------------------------------------------------------------------------------------------------------------------------------ Octet Semantics UTF-8 Codepoint Semantics by traditional name by new name Note and Limitations ------------------------------------------------------------------------------------------------------------------------------------------ chop UTF8::R2::chop(@_) usually chomp() is useful ------------------------------------------------------------------------------------------------------------------------------------------ chr UTF8::R2::chr($_) returns UTF-8 codepoint octets by UTF-8 number (not by Unicode number) ------------------------------------------------------------------------------------------------------------------------------------------ getc UTF8::R2::getc(FILEHANDLE) get UTF-8 codepoint octets ------------------------------------------------------------------------------------------------------------------------------------------ index UTF8::R2::index($_, 'ABC', 5) index() is compatible and usually useful ------------------------------------------------------------------------------------------------------------------------------------------ lc UTF8::R2::lc($_) works as tr/A-Z/a-z/, universally ------------------------------------------------------------------------------------------------------------------------------------------ lcfirst UTF8::R2::lcfirst($_) see UTF8::R2::lc() ------------------------------------------------------------------------------------------------------------------------------------------ length UTF8::R2::length($_) length() is compatible and usually useful ------------------------------------------------------------------------------------------------------------------------------------------ // or m// or qr// UTF8::R2::qr(qr/$utf8regex/imsxogc) not supports metasymbol \X that match grapheme or not supports POSIX character class (like an [:alpha:]) use UTF8::R2 qw(%mb); not supports named character (such as \N{GREEK SMALL LETTER EPSILON}, \N{greek:epsilon}, or \N{epsilon}) $mb{qr/$utf8regex/imsxogc} not supports character properties (like \p{PROP} and \P{PROP}) Special Escapes in Regex Support Perl Version -------------------------------------------------------------------------------------------------- $mb{qr/ \x{Unicode} /} since perl 5.006 $mb{qr/ [^ ... ] /} ** CAUTION ** perl 5.006 cannot this $mb{qr/ \h /} since perl 5.010 $mb{qr/ \v /} since perl 5.010 $mb{qr/ \H /} since perl 5.010 $mb{qr/ \V /} since perl 5.010 $mb{qr/ \R /} since perl 5.010 $mb{qr/ \N /} since perl 5.012 ------------------------------------------------------------------------------------------------------------------------------------------ ?? or m?? (nothing) ------------------------------------------------------------------------------------------------------------------------------------------ ord UTF8::R2::ord($_) returns UTF-8 number (not Unicode number) by UTF-8 codepoint octets ------------------------------------------------------------------------------------------------------------------------------------------ pos (nothing) ------------------------------------------------------------------------------------------------------------------------------------------ reverse UTF8::R2::reverse(@_) ------------------------------------------------------------------------------------------------------------------------------------------ rindex UTF8::R2::rindex($_, 'ABC', 5) rindex() is compatible and usually useful ------------------------------------------------------------------------------------------------------------------------------------------ s/before/after/imsxoegr s<@{[UTF8::R2::qr(qr/before/imsxo)]}><after>egr or use UTF8::R2 qw(%mb); s<$mb{qr/before/imsxo}><after>egr ------------------------------------------------------------------------------------------------------------------------------------------ split// UTF8::R2::split(qr/$utf8regex/imsxo, $_, 3) *CAUTION* UTF8::R2::split(/re/,$_,3) means UTF8::R2::split($_ =~ /re/,$_,3) ------------------------------------------------------------------------------------------------------------------------------------------ sprintf (nothing) ------------------------------------------------------------------------------------------------------------------------------------------ substr UTF8::R2::substr($_, 0, 5) substr() is compatible and usually useful :lvalue feature needs perl 5.014 or later ------------------------------------------------------------------------------------------------------------------------------------------ tr/// or y/// UTF8::R2::tr($_, 'ABC', 'XYZ', 'cdsr') not support range of codepoint(like a "tr/A-Z/a-z/") ------------------------------------------------------------------------------------------------------------------------------------------ uc UTF8::R2::uc($_) works as tr/a-z/A-Z/, universally ------------------------------------------------------------------------------------------------------------------------------------------ ucfirst UTF8::R2::ucfirst($_) see UTF8::R2::uc() ------------------------------------------------------------------------------------------------------------------------------------------ write (nothing) ------------------------------------------------------------------------------------------------------------------------------------------
P.401 See chapter 15: Unicode of ISBN 0-596-00027-8 Programming Perl Third Edition.
Before the introduction of Unicode support in perl, The eq operator just compared the byte-strings represented by two scalars. Beginning with perl 5.8, eq compares two byte-strings with simultaneous consideration of the UTF8 flag.
-- we have been taught so for a long time.
Perl is a powerful language for everyone, but UTF8 flag is a barrier for common beginners. Because everyone can only one task on one time. So calling Encode::encode() and Encode::decode() in application program is not better way. Making two scripts for information processing and encoding conversion may be better. Please trust me.
/* * You are not expected to understand this. */ Information processing model beginning with perl 5.8 +----------------------+---------------------+ | Text strings | | +----------+-----------| Binary strings | | UTF-8 | Latin-1 | | +----------+-----------+---------------------+ | UTF8 | Not UTF8 | | Flagged | Flagged | +--------------------------------------------+ http://perl-users.jp/articles/advent-calendar/2010/casual/4 Confusion of Perl string model is made from double meanings of "Binary string." Meanings of "Binary string" are 1. Non-Text string 2. Digital octet string Let's draw again using those term. +----------------------+---------------------+ | Text strings | | +----------+-----------| Non-Text strings | | UTF-8 | Latin-1 | | +----------+-----------+---------------------+ | UTF8 | Not UTF8 | | Flagged | Flagged | +--------------------------------------------+ | Digital octet string | +--------------------------------------------+
There are people who don't agree to change in the character string processing model of Perl 5.8. It is impossible to get to agree it to majority of Perl user who hardly ever use Perl. How to solve it by returning to an original method, let's drag out page 402 of the Programming Perl, 3rd ed. again.
Information processing model beginning with perl3 or this software of UNIX/C-ism. +--------------------------------------------+ | Text string as Digital octet string | | Digital octet string as Text string | +--------------------------------------------+ | Not UTF8 Flagged, No Mojibake | +--------------------------------------------+ In UNIX Everything is a File - In UNIX everything is a stream of bytes - In UNIX the filesystem is used as a universal name space Native Encoding Scripting - native encoding of file contents - native encoding of file name on filesystem - native encoding of command line - native encoding of environment variable - native encoding of API - native encoding of network packet - native encoding of database
Ideally, We'd like to achieve these five Goals:
Goal #1:
Old byte-oriented programs should not spontaneously break on the old byte-oriented data they used to work on.
This goal has been achieved by that this software is additional code for perl like utf8 pragma. Perl should work same as past Perl if added nothing.
Goal #2:
Old byte-oriented programs should magically start working on the new character-oriented data when appropriate.
Not "magically." You must decide and write octet semantics or UTF-8 codepoint semantics yourself in case by case. Perhaps almost all regular expressions should have UTF-8 codepoint semantics. And other all should have octet semantics.
Goal #3:
Programs should run just as fast in the new character-oriented mode as in the old byte-oriented mode.
It is almost possible. Because UTF-8 encoding doesn't need multibyte anchoring in regular expression.
Goal #4:
Perl should remain one language, rather than forking into a byte-oriented Perl and a character-oriented Perl.
UTF8::R2 module remains one language and one interpreter by providing codepoint semantics subroutines.
Goal #5:
UTF8::R2 module users will be able to maintain it by Perl.
May the UTF8::R2 be with you, always.
Back when Programming Perl, 3rd ed. was written, UTF8 flag was not born and Perl is designed to make the easy jobs easy. This software provides programming environment like at that time.
Some computer scientists (the reductionists, in particular) would like to deny it, but people have funny-shaped minds. Mental geography is not linear, and cannot be mapped onto a flat surface without severe distortion. But for the last score years or so, computer reductionists have been first bowing down at the Temple of Orthogonality, then rising up to preach their ideas of ascetic rectitude to any who would listen. Their fervent but misguided desire was simply to squash your mind to fit their mindset, to smush your patterns of thought into some sort of Hyperdimensional Flatland. It's a joyless existence, being smushed. --- Learning Perl on Win32 Systems If you think this is a big headache, you're right. No one likes this situation, but Perl does the best it can with the input and encodings it has to deal with. If only we could reset history and not make so many mistakes next time. --- Learning Perl 6th Edition The most important thing for most people to know about handling Unicode data in Perl, however, is that if you don't ever use any Uni- code data -- if none of your files are marked as UTF-8 and you don't use UTF-8 locales -- then you can happily pretend that you're back in Perl 5.005_03 land; the Unicode features will in no way interfere with your code unless you're explicitly using them. Sometimes the twin goals of embracing Unicode but not disturbing old-style byte-oriented scripts has led to compromise and confusion, but it's the Perl way to silently do the right thing, which is what Perl ends up doing. --- Advanced Perl Programming, 2nd Edition
INABA Hitoshi <ina@cpan.org>
This project was originated by INABA Hitoshi.
This software is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.
This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
To install UTF8::R2, copy and paste the appropriate command in to your terminal.
cpanm
cpanm UTF8::R2
CPAN shell
perl -MCPAN -e shell install UTF8::R2
For more information on module installation, please visit the detailed CPAN module installation guide.