The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

UTF8::R2 - makes UTF-8 scripting easy for enterprise use or LTS

SYNOPSIS

  use UTF8::R2;
  use UTF8::R2 ver.sion;            # match or die
  use UTF8::R2 qw( RFC3629 );       # m/./ matches RFC3629 codepoint (default)
  use UTF8::R2 qw( RFC2279 );       # m/./ matches RFC2279 codepoint
  use UTF8::R2 qw( WTF8 );          # m/./ matches WTF-8 codepoint
  use UTF8::R2 qw( RFC3629.ja_JP ); # optimized RFC3629 for ja_JP
  use UTF8::R2 qw( WTF8.ja_JP );    # optimized WTF-8 for ja_JP
  use UTF8::R2 qw( %mb );           # multibyte regex by %mb

    $result = UTF8::R2::chop(@_)
    $result = UTF8::R2::chr($_)
    $result = UTF8::R2::getc(FILEHANDLE)
    $result = UTF8::R2::index($_, 'ABC', 5)
    $result = UTF8::R2::lc($_)
    $result = UTF8::R2::lcfirst($_)
    $result = UTF8::R2::length($_)
    $result = UTF8::R2::ord($_)
    $result = UTF8::R2::qr(qr/$utf8regex/imsxogc)
    @result = UTF8::R2::reverse(@_)
    $result = UTF8::R2::reverse(@_)
    $result = UTF8::R2::reverse()
    $result = UTF8::R2::rindex($_, 'ABC', 5)
    @result = UTF8::R2::split(qr/$utf8regex/, $_, 3)
    $result = UTF8::R2::substr($_, 0, 5)
    $result = UTF8::R2::tr($_, 'A-C', 'X-Z', 'cdsr')
    $result = UTF8::R2::uc($_)
    $result = UTF8::R2::ucfirst($_)

    use UTF8::R2 qw(%mb);
    $result = $_ =~ $mb{qr/$utf8regex/imsxogc}
    $result = $_ =~ s<$mb{qr/before/imsxo}><after>egr

OCTET SEMANTICS FUNCTIONS VS. CODEPOINT SEMANTICS SUBROUTINES

Because this module override nothing, the embedded functions provide octet semantics continue. UTF-8 codepoint semantics is provided by the new subroutine name.

  ------------------------------------------------------------------------------------------------------------------------------------------
  Octet Semantics         UTF-8 Codepoint Semantics
  by traditional name     by new name                                Note and Limitations
  ------------------------------------------------------------------------------------------------------------------------------------------
  chop                    UTF8::R2::chop(@_)                         usually chomp() is useful
  ------------------------------------------------------------------------------------------------------------------------------------------
  chr                     UTF8::R2::chr($_)                          returns UTF-8 codepoint octets by UTF-8 number (not by Unicode number)
  ------------------------------------------------------------------------------------------------------------------------------------------
  getc                    UTF8::R2::getc(FILEHANDLE)                 get UTF-8 codepoint octets
  ------------------------------------------------------------------------------------------------------------------------------------------
  index                   UTF8::R2::index($_, 'ABC', 5)              index() is compatible and usually useful
  ------------------------------------------------------------------------------------------------------------------------------------------
  lc                      UTF8::R2::lc($_)                           works as tr/A-Z/a-z/, universally
  ------------------------------------------------------------------------------------------------------------------------------------------
  lcfirst                 UTF8::R2::lcfirst($_)                      see UTF8::R2::lc()
  ------------------------------------------------------------------------------------------------------------------------------------------
  length                  UTF8::R2::length($_)                       length() is compatible and usually useful
  ------------------------------------------------------------------------------------------------------------------------------------------
  // or m// or qr//       UTF8::R2::qr(qr/$utf8regex/imsxogc)        not supports metasymbol \X that match grapheme
                            or                                       not supports POSIX character class (like an [:alpha:])
                          use UTF8::R2 qw(%mb);                      not supports named character (such as \N{GREEK SMALL LETTER EPSILON}, \N{greek:epsilon}, or \N{epsilon})
                          $mb{qr/$utf8regex/imsxogc}                 not supports character properties (like \p{PROP} and \P{PROP})

                          Special Escapes in Regex                   Support Perl Version
                          --------------------------------------------------------------------------------------------------
                          $mb{qr/ \x{Unicode} /}                     since perl 5.006
                          $mb{qr/ [^ ... ] /}                        ** CAUTION ** perl 5.006 cannot this
                          $mb{qr/ \h /}                              since perl 5.010
                          $mb{qr/ \v /}                              since perl 5.010
                          $mb{qr/ \H /}                              since perl 5.010
                          $mb{qr/ \V /}                              since perl 5.010
                          $mb{qr/ \R /}                              since perl 5.010
                          $mb{qr/ \N /}                              since perl 5.012

  ------------------------------------------------------------------------------------------------------------------------------------------
  ?? or m??                 (nothing)
  ------------------------------------------------------------------------------------------------------------------------------------------
  ord                     UTF8::R2::ord($_)                          returns UTF-8 number (not Unicode number) by UTF-8 codepoint octets
  ------------------------------------------------------------------------------------------------------------------------------------------
  pos                       (nothing)
  ------------------------------------------------------------------------------------------------------------------------------------------
  reverse                 UTF8::R2::reverse(@_)
  ------------------------------------------------------------------------------------------------------------------------------------------
  rindex                  UTF8::R2::rindex($_, 'ABC', 5)             rindex() is compatible and usually useful
  ------------------------------------------------------------------------------------------------------------------------------------------
  s/before/after/imsxoegr s<@{[UTF8::R2::qr(qr/before/imsxo)]}><after>egr
                            or
                          use UTF8::R2 qw(%mb);
                          s<$mb{qr/before/imsxo}><after>egr
  ------------------------------------------------------------------------------------------------------------------------------------------
  split//                 UTF8::R2::split(qr/$utf8regex/imsxo, $_, 3)  *CAUTION* UTF8::R2::split(/re/,$_,3) means UTF8::R2::split($_ =~ /re/,$_,3)
  ------------------------------------------------------------------------------------------------------------------------------------------
  sprintf                   (nothing)
  ------------------------------------------------------------------------------------------------------------------------------------------
  substr                  UTF8::R2::substr($_, 0, 5)                 substr() is compatible and usually useful
                                                                     :lvalue feature needs perl 5.014 or later
  ------------------------------------------------------------------------------------------------------------------------------------------
  tr/// or y///           UTF8::R2::tr($_, 'A-C', 'X-Z', 'cdsr')     range of codepoint by hyphen supports ASCII only
  ------------------------------------------------------------------------------------------------------------------------------------------
  uc                      UTF8::R2::uc($_)                           works as tr/a-z/A-Z/, universally
  ------------------------------------------------------------------------------------------------------------------------------------------
  ucfirst                 UTF8::R2::ucfirst($_)                      see UTF8::R2::uc()
  ------------------------------------------------------------------------------------------------------------------------------------------
  write                     (nothing)
  ------------------------------------------------------------------------------------------------------------------------------------------

UTF8 Flag Considered Harmful, and Our Goals

P.401 See chapter 15: Unicode of ISBN 0-596-00027-8 Programming Perl Third Edition.

Before the introduction of Unicode support in perl, The eq operator just compared the byte-strings represented by two scalars. Beginning with perl 5.8, eq compares two byte-strings with simultaneous consideration of the UTF8 flag.

-- we have been taught so for a long time.

Perl is a powerful language for everyone, but UTF8 flag is a barrier for common beginners. Because everyone can only one task on one time. So calling Encode::encode() and Encode::decode() in application program is not better way. Making two scripts for information processing and encoding conversion may be better. Please trust me.

 /*
  * You are not expected to understand this.
  */
 
  Information processing model beginning with perl 5.8
 
    +----------------------+---------------------+
    |     Text strings     |                     |
    +----------+-----------|    Binary strings   |
    |  UTF-8   |  Latin-1  |                     |
    +----------+-----------+---------------------+
    | UTF8     |            Not UTF8             |
    | Flagged  |            Flagged              |
    +--------------------------------------------+
    http://perl-users.jp/articles/advent-calendar/2010/casual/4

  Confusion of Perl string model is made from double meanings of
  "Binary string."
  Meanings of "Binary string" are
  1. Non-Text string
  2. Digital octet string

  Let's draw again using those term.
 
    +----------------------+---------------------+
    |     Text strings     |                     |
    +----------+-----------|   Non-Text strings  |
    |  UTF-8   |  Latin-1  |                     |
    +----------+-----------+---------------------+
    | UTF8     |            Not UTF8             |
    | Flagged  |            Flagged              |
    +--------------------------------------------+
    |            Digital octet string            |
    +--------------------------------------------+

There are people who don't agree to change in the character string processing model of Perl 5.8. It is impossible to get to agree it to majority of Perl user who hardly ever use Perl. How to solve it by returning to an original method, let's drag out page 402 of the Programming Perl, 3rd ed. again.

  Information processing model beginning with perl3 or this software
  of UNIX/C-ism.

    +--------------------------------------------+
    |    Text string as Digital octet string     |
    |    Digital octet string as Text string     |
    +--------------------------------------------+
    |       Not UTF8 Flagged, No Mojibake        |
    +--------------------------------------------+

  In UNIX Everything is a File
  - In UNIX everything is a stream of bytes
  - In UNIX the filesystem is used as a universal name space

  Native Encoding Scripting
  - native encoding of file contents
  - native encoding of file name on filesystem
  - native encoding of command line
  - native encoding of environment variable
  - native encoding of API
  - native encoding of network packet
  - native encoding of database

Ideally, We'd like to achieve these five Goals:

  • Goal #1:

    Old byte-oriented programs should not spontaneously break on the old byte-oriented data they used to work on.

    This goal has been achieved by that this software is additional code for perl like utf8 pragma. Perl should work same as past Perl if added nothing.

  • Goal #2:

    Old byte-oriented programs should magically start working on the new character-oriented data when appropriate.

    Not "magically." You must decide and write octet semantics or UTF-8 codepoint semantics yourself in case by case. Perhaps almost all regular expressions should have UTF-8 codepoint semantics. And other all should have octet semantics.

  • Goal #3:

    Programs should run just as fast in the new character-oriented mode as in the old byte-oriented mode.

    It is almost possible. Because UTF-8 encoding doesn't need multibyte anchoring in regular expression.

  • Goal #4:

    Perl should remain one language, rather than forking into a byte-oriented Perl and a character-oriented Perl.

    UTF8::R2 module remains one language and one interpreter by providing codepoint semantics subroutines.

  • Goal #5:

    UTF8::R2 module users will be able to maintain it by Perl.

    May the UTF8::R2 be with you, always.

Back when Programming Perl, 3rd ed. was written, UTF8 flag was not born and Perl is designed to make the easy jobs easy. This software provides programming environment like at that time.

Perl's motto

   Some computer scientists (the reductionists, in particular) would
  like to deny it, but people have funny-shaped minds. Mental geography
  is not linear, and cannot be mapped onto a flat surface without
  severe distortion. But for the last score years or so, computer
  reductionists have been first bowing down at the Temple of Orthogonality,
  then rising up to preach their ideas of ascetic rectitude to any who
  would listen.
 
   Their fervent but misguided desire was simply to squash your mind to
  fit their mindset, to smush your patterns of thought into some sort of
  Hyperdimensional Flatland. It's a joyless existence, being smushed.
  --- Learning Perl on Win32 Systems

  If you think this is a big headache, you're right. No one likes
  this situation, but Perl does the best it can with the input and
  encodings it has to deal with. If only we could reset history and
  not make so many mistakes next time.
  --- Learning Perl 6th Edition

   The most important thing for most people to know about handling
  Unicode data in Perl, however, is that if you don't ever use any Uni-
  code data -- if none of your files are marked as UTF-8 and you don't
  use UTF-8 locales -- then you can happily pretend that you're back in
  Perl 5.005_03 land; the Unicode features will in no way interfere with
  your code unless you're explicitly using them. Sometimes the twin
  goals of embracing Unicode but not disturbing old-style byte-oriented
  scripts has led to compromise and confusion, but it's the Perl way to
  silently do the right thing, which is what Perl ends up doing.
  --- Advanced Perl Programming, 2nd Edition

AUTHOR

INABA Hitoshi <ina@cpan.org>

This project was originated by INABA Hitoshi.

LICENSE AND COPYRIGHT

This software is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.