NAME
UTF8::R2 - makes UTF-8 scripting easy for enterprise use
SYNOPSIS
Any one of them:
UTF-8 subroutines:
mb::
length
(
$_
)
mb::
substr
(
$_
, 0, 5)
mb::
tr
(
$_
,
'ABC'
,
'XYZ'
,
'cdsr'
)
UTF-8 regular expressions:
$mb_regex
=~
$mb
{
qr/ utf8_regex_here . \D \H \N \R \S \V \W \b \d \h \s \v \w \x{UTF8hex} [ \D \H \S \V \W \b \d \h \s \v \w \x{UTF8hex} \x{UTF8hex}-\x{UTF8hex} [:POSIX:] [:^POSIX:] ] ? + * {n} {n,} {n,m} /
imsxo}
# no /gc
$_
=~ m<\G
$mb
{
qr/$utf8regex/
imsxo}>gc
$_
=~ s<
$mb
{
qr/before/
imsxo}><
after
>egr
mb::
split
(
qr/$utf8regex/
imsxo,
$_
, 3)
supported encodings:
UTF-8(RFC3629), UTF-8(RFC2279), WTF8, RFC3629.ja_JP, and WTF8.ja_JP
supported perl versions:
perl version 5.005_03 to newest perl
INSTALLATION BY MAKE-COMMAND
To install this software by make, type the following:
perl Makefile.PL
make
make test
make install
INSTALLATION WITHOUT MAKE-COMMAND (for DOS-like system)
To install this software without make, type the following:
pmake.bat test
pmake.bat install
DESCRIPTION
It may sound a little ambitious, but UTF8::R2 module is aiming to replace utf8 pragma.
Perl is said to have been able to handle Unicode since version 5.8. However, unlike JPerl, "Easy jobs must be easy" has been lost.
This software has the following features
supports UTF-8 literals of Perl scripts
supports UTF-8(RFC3629), UTF-8(RFC2279), WTF8, RFC3629.ja_JP, and WTF8.ja_JP
does not use the UTF8 flag to avoid MOJIBAKE
handles raw encoding to support GAIJI
supports codepoint classes in regular expressions to work as UTF-8 codepoint
does not change features of octet-oriented built-in functions
You have using mb::* subroutines if you want codepoint semantics
lc(), lcfirst(), uc(), and ucfirst() convert US-ASCII only
codepoint range by hyphen of mb::tr() supports US-ASCII only
UTF-8 like Encodings supported by this software
The encodings supported by this software and their range of octets are as follows.
UTF-8 (RFC2279)
-------------------------------------
1st 2nd 3rd 4th
-------------------------------------
C2..DF 80..BF
E0..EF 80..BF 80..BF
F0..F4 80..BF 80..BF 80..BF
00..7F
-------------------------------------
https://www.ietf.org/rfc/rfc2279.txt
needs no multibyte anchoring
needs no escaping meta char of 2nd-4th octets
safe US-ASCII casefolding of 2nd-4th octet
allows encoding surrogate codepoints even if it is not pair
UTF-8 (RFC3629)
-------------------------------------
1st 2nd 3rd 4th
-------------------------------------
C2..DF 80..BF
E0..E0 A0..BF 80..BF
E1..EC 80..BF 80..BF
ED..ED 80..9F 80..BF
EE..EF 80..BF 80..BF
F0..F0 90..BF 80..BF 80..BF
F1..F3 80..BF 80..BF 80..BF
F4..F4 80..8F 80..BF 80..BF
00..7F
-------------------------------------
https://en.wikipedia.org/wiki/UTF-8
needs no multibyte anchoring
needs no escaping meta char of 2nd-4th octets
safe US-ASCII casefolding of 2nd-4th octet
enforces surrogate codepoints must be paired
WTF-8
-------------------------------------
1st 2nd 3rd 4th
-------------------------------------
C2..DF 80..BF
E0..E0 A0..BF 80..BF
E1..EF 80..BF 80..BF
F0..F0 90..BF 80..BF 80..BF
F1..F3 80..BF 80..BF 80..BF
F4..F4 80..8F 80..BF 80..BF
00..7F
-------------------------------------
http://simonsapin.github.io/wtf-8/
superset of UTF-8 that encodes surrogate codepoints if they are not in a pair
needs no multibyte anchoring
needs no escaping meta char of 2nd-4th octets
safe US-ASCII casefolding of 2nd-4th octet
UTF-8 (RFC3629.ja_JP)
-------------------------------------
1st 2nd 3rd 4th
-------------------------------------
E1..EC 80..BF 80..BF
C2..DF 80..BF
EE..EF 80..BF 80..BF
F0..F0 90..BF 80..BF 80..BF
E0..E0 A0..BF 80..BF
ED..ED 80..9F 80..BF
F1..F3 80..BF 80..BF 80..BF
F4..F4 80..8F 80..BF 80..BF
00..7F
-------------------------------------
https://en.wikipedia.org/wiki/UTF-8
needs no multibyte anchoring
needs no escaping meta char of 2nd-4th octets
safe US-ASCII casefolding of 2nd-4th octet
enforces surrogate codepoints must be paired
optimized for ja_JP
WTF-8.ja_JP
-------------------------------------
1st 2nd 3rd 4th
-------------------------------------
E1..EF 80..BF 80..BF
C2..DF 80..BF
E0..E0 A0..BF 80..BF
F0..F0 90..BF 80..BF 80..BF
F1..F3 80..BF 80..BF 80..BF
F4..F4 80..8F 80..BF 80..BF
00..7F
-------------------------------------
http://simonsapin.github.io/wtf-8/
superset of UTF-8 that encodes surrogate codepoints if they are not in a pair
needs no multibyte anchoring
needs no escaping meta char of 2nd-4th octets
safe US-ASCII casefolding of 2nd-4th octet
optimized for ja_JP
UTF-8 subroutines provided by this software
This software provides traditional feature "as was." The new UTF-8 features are provided by subroutines with new names. If you like utf8 pragma, mb::* subroutines will help you. On other hand, If you love JPerl, those subroutines will not help you very much. Traditional functions of Perl are useful still now in octet-oriented semantics.
elder <<<--- age --->>> younger
---------------------------------------------------------------------------------------------------
bare Perl5 JPerl5 pragma modulino module
---------------------------------------------------------------------------------------------------
chop
--- ---
chop
chop
chr
chr
bytes::
chr
chr
chr
getc
getc
---
getc
getc
index
--- bytes::
index
index
index
lc
--- --- CORE::
lc
CORE::
lc
(=
tr
/A-Z/a-z/)
lcfirst
--- --- CORE::
lcfirst
CORE::
lcfirst
(=
tr
/A-Z/a-z/)
length
length
bytes::
length
length
length
ord
ord
bytes::
ord
ord
ord
reverse
reverse
---
reverse
reverse
rindex
--- bytes::
rindex
rindex
rindex
substr
substr
bytes::
substr
substr
substr
uc
--- --- CORE::
uc
CORE::
uc
(=
tr
/a-z/A-Z/)
ucfirst
--- --- CORE::
ucfirst
CORE::
ucfirst
(=
tr
/a-z/A-Z/)
---
chop
chop
mb::
chop
mb::
chop
--- ---
chr
mb::
chr
mb::
chr
--- ---
getc
mb::
getc
mb::
getc
---
index
--- mb::index_byte mb::index_byte
--- ---
index
mb::
index
mb::
index
---
lc
---
lc
lc
(= mb::
lc
)
---
lcfirst
---
lcfirst
lcfirst
(= mb::
lcfirst
)
--- ---
length
mb::
length
mb::
length
--- ---
ord
mb::
ord
mb::
ord
--- ---
reverse
mb::
reverse
mb::
reverse
---
rindex
--- mb::rindex_byte mb::rindex_byte
--- ---
rindex
mb::
rindex
mb::
rindex
--- ---
substr
mb::
substr
mb::
substr
---
uc
---
uc
uc
(= mb::
uc
)
---
ucfirst
---
ucfirst
ucfirst
(= mb::
ucfirst
)
--- ---
lc
(mb::Casing::
lc
) (mb::Casing::
lc
)
--- ---
lcfirst
(mb::Casing::
lcfirst
) (mb::Casing::
lcfirst
)
--- ---
uc
(mb::Casing::
uc
) (mb::Casing::
uc
)
--- ---
ucfirst
(mb::Casing::
ucfirst
) (mb::Casing::
ucfirst
)
---------------------------------------------------------------------------------------------------
do
'file'
---
do
'file'
do
'file'
do
'file'
eval
'string'
---
eval
'string'
eval
'string'
eval
'string'
require
'file'
---
require
'file'
require
'file'
require
'file'
no
Module ---
no
Module
no
Module
no
Module
---
do
'file'
do
'file'
mb::
do
'file'
mb::
do
'file'
---
eval
'string'
eval
'string'
mb::
eval
'string'
mb::
eval
'string'
---
require
'file'
require
'file'
mb::
require
'file'
mb::
require
'file'
---
no
Module
no
Module mb::
no
Module
no
Module
$^X --- $^X $^X $^X
--- $^X $^X
$mb::PERL
$mb::PERL
$0 $0 $0
$mb::ORIG_PROGRAM_NAME
$mb::ORIG_PROGRAM_NAME
--- --- --- $0 $0
---------------------------------------------------------------------------------------------------
index brothers
------------------------------------------------------------------------------------------
functions or subs works as returns as considered
------------------------------------------------------------------------------------------
index
octet octet useful, bare Perl like
rindex
octet octet useful, bare Perl like
mb::
index
codepoint codepoint not so useful, utf8 pragma like
mb::
rindex
codepoint codepoint not so useful, utf8 pragma like
mb::index_byte codepoint octet useful, JPerl like
mb::rindex_byte codepoint octet useful, JPerl like
------------------------------------------------------------------------------------------
The most useful of the above are mb::index_byte() and mb::rindex_byte(), but it's more convenient to use regular expressions than those. So you can forget about these subroutines.
Codepoint-Semantics Regular Expression
This software adds the ability to handle UTF-8 code points to bare Perl; it does not provide the ability to handle characters and graphene. Because this module override nothing, the functions of bare Perl provide octet semantics continue. UTF-8 codepoint semantics of regular expression is provided by new sintax. "tr///" has nothing to do with regular expressions, but we listed here for convenience.
------------------------------------------------------------------------------------------------------------------------------------------
Octet-semantics UTF-8 Codepoint-semantics
------------------------------------------------------------------------------------------------------------------------------------------
// or m// or
qr//
$mb
{
qr/ utf8_regex_here . \D \H \N \R \S \V \W \b \d \h \s \v \w \x{UTF8hex} [ \D \H \S \V \W \b \d \h \s \v \w \x{UTF8hex} \x{UTF8hex}-\x{UTF8hex} [:POSIX:] [:^POSIX:] ] ? + * {n} {n,} {n,m} /
imsxo}
$mb
{
qr/$utf8regex/
imsxo} modifier i, m, s, x, o work on compile
time
m<\G
$mb
{
qr/$utf8regex/
imsxo}>gc modifier g,c work on run
time
Special Escapes in Regex Support Perl Version
--------------------------------------------------------------------------------------------------
$mb
{
qr/ \x{UTF8hex} /
} since perl 5.005
$mb
{
qr/ [\x{UTF8hex}] /
} since perl 5.005
$mb
{
qr/ [[:POSIX:]] /
} since perl 5.005
$mb
{
qr/ [[:^POSIX:]] /
} since perl 5.005
$mb
{
qr/ [^ ... ] /
} ** CAUTION ** perl 5.006 cannot this
$mb
{
qr/ [\x{UTF8hex}-\x{UTF8hex}] /
} since perl 5.008
$mb
{
qr/ \h /
} since perl 5.010
$mb
{
qr/ \v /
} since perl 5.010
$mb
{
qr/ \H /
} since perl 5.010
$mb
{
qr/ \V /
} since perl 5.010
$mb
{
qr/ \R /
} since perl 5.010
$mb
{
qr/ \N /
} since perl 5.012
--------------------------------------------------------------------------------------------------
(max \x{UTF8hex} is \x{7FFFFFFF}, so cannot 4 octet codepoints, pardon me please!)
------------------------------------------------------------------------------------------------------------------------------------------
s/
before
/
after
/imsxoegr s<
$mb
{
qr/before/
imsxo}><
after
>egr
------------------------------------------------------------------------------------------------------------------------------------------
split
// mb::
split
(
qr/$utf8regex/
imsxo,
$_
, 3)
*CAUTION
* mb::
split
(/re/,
$_
,3) means mb::
split
(
$_
=~ /re/,
$_
,3)
------------------------------------------------------------------------------------------------------------------------------------------
tr
/// or y/// mb::
tr
(
$_
,
'A-C'
,
'X-Z'
,
'cdsr'
) range of codepoint by hyphen supports ASCII only
------------------------------------------------------------------------------------------------------------------------------------------
Porting from script in bare Perl4, and bare Perl5
If you want to write US-ASCII scripts from now on, or port existing US-ASCII scripts to UTF8::R2 environment
Write scripts the usual way. Running an US-ASCII script using UTF8::R2 allows you to treat UTF-8 codepoints as I/O data.
Porting from script in JPerl4, and JPerl5
If you want to port existing JPerl scripts to UTF8::R2 environment
There are only a few places that need to be rewritten. If you write the functionality of "index()" and "rindex()" in regular expressions, the only difference left is "chop()". If you want "chop()" that like JPerl, you need to write "mb::chop()" when UTF8::R2 environment.
-----------------------------------------------------------------
original script in script
with
-----------------------------------------------------------------
chop
mb::
chop
index
mb::index_byte
rindex
mb::rindex_byte
-----------------------------------------------------------------
However substantially is ...
-----------------------------------------------------------------
original script in script
with
-----------------------------------------------------------------
chop
95% to
chomp
, 4% to mb::
chop
, 1% to
chop
index
(already written in regular expression)
rindex
(already written in regular expression)
-----------------------------------------------------------------
Substantially put, JPerl users can write programs the same way they used to.
Porting from script with utf8 pragma
If you want to port existing scripts that has utf8 pragma to UTF8::R2 environment
Like traditional style, Perl's built-in functions without package names provide octet-oriented functionality. Thus, "length()" and "substr()" work on an octet basis, universally. When you need multibyte functionally, you need to use subroutines in the "mb::" package, on every time.
-----------------------------------------------------------------
original script
with
script
with
-----------------------------------------------------------------
chop
mb::
chop
chr
mb::
chr
getc
mb::
getc
index
mb::
index
lc
---
lcfirst
---
length
mb::
length
ord
mb::
ord
reverse
mb::
reverse
rindex
mb::
rindex
substr
mb::
substr
uc
---
ucfirst
---
-----------------------------------------------------------------
Porting from script with mb.pm modulino
You can call subroutines by mb.pm-like names using "use UTF8::R2 qw(*mb);".
Add this line first
Add $mb{...} (or "mb::" of mb::split) to UTF-8 regular expressions like this
$_
=~
$mb
{
qr/ utf8_regex_here /
imsxo}
$_
=~ m<\G
$mb
{
qr/ utf8_regex_here /
imsxo}>gc
$_
=~ s<
$mb
{
qr/ before /
imsxo}><
after
>egr
mb::
split
(
qr/ utf8_regex_here /
imsxo, ...);
# *MUST* qr/.../, *NOT* /.../
Use mb::tr() subroutine for tr/// that supports UTF-8
Have to write like this
mb::
tr
(
$_
,
'ABC'
,
'XYZ'
,
'cdsr'
);
Instead of this
$_
=~
tr
/ABC/XYZ/csdsr;
Use mb::* subroutines
You can use subroutines by mb.pm-like names.
subroutines to scripts born in mb.pm modulino
--------------------------------------------------
mb.pm script
with
--------------------------------------------------
mb::
chop
mb::
chop
mb::
chr
mb::
chr
mb::
do
'file'
mb::
do
'file'
mb::
eval
'string'
mb::
eval
'string'
mb::
getc
mb::
getc
mb::
index
mb::
index
mb::index_byte mb::index_byte
mb::
length
mb::
length
mb::
ord
mb::
ord
mb::
require
'file'
mb::
require
'file'
mb::
reverse
mb::
reverse
mb::
rindex
mb::
rindex
mb::rindex_byte mb::rindex_byte
mb::
substr
mb::
substr
--------------------------------------------------
However...
Use mb::* variables
You can use variables by mb.pm-like names.
variables to scripts born in mb.pm modulino
--------------------------------------------------
mb.pm script
with
--------------------------------------------------
$mb::PERL
$mb::PERL
$mb::ORIG_PROGRAM_NAME
$mb::ORIG_PROGRAM_NAME
--------------------------------------------------
DEPENDENCIES
This UTF8::R2 module requires perl5.00503 or later to use. Also requires 'strict' module. It requires the 'warnings' module, too if perl 5.6 or later.
Our Goals
P.401 See chapter 15: Unicode of ISBN 0-596-00027-8 Programming Perl Third Edition.
Before the introduction of Unicode support in perl, The eq operator just compared the byte-strings represented by two scalars. Beginning with perl 5.8, eq compares two byte-strings with simultaneous consideration of the UTF8 flag.
"I/O flow" https://metacpan.org/pod/perlunitut#I/O-flow-(the-actual-5-minute-tutorial) shows us this
The typical input/output flow of a program is:
Receive and decode
Process
Encode and output
-- we have been taught so for a long time.
However,
Every inside
has
its inside that
has
its inside that
has
its inside that
has
...
Every outside
has
its outside that
has
its outside that
has
its outside that
has
...
We know inside has its inside more, outside has its outside more. Inside is never only one and outside is never only one. So string model of Perl 5.8 cannot fit our common thinking.
Spreading of EMOJI on MBCS encoding in today had remind us this idea is not bad.
UTF8 flag is harmful.
Information processing model beginning with perl 5.8
+----------------------+---------------------+
| Text strings | |
+----------+-----------| Binary strings |
| UTF-8 | Latin-1 | |
+----------+-----------+---------------------+
| UTF8 | Not UTF8 |
| Flagged | Flagged |
+--------------------------------------------+
http://perl-users.jp/articles/advent-calendar/2010/casual/4
Since double meanings of "Binary string", Perl string model has some confusing.
It's following two meanings:
Non-Text string
Digital octet string
Let's write again using them.
+----------------------+---------------------+
| Text strings | |
+----------+-----------| Non-Text strings |
| UTF-8 | Latin-1 | |
+----------+-----------+---------------------+
| UTF8 | Not UTF8 |
| Flagged | Flagged |
+--------------------------------------------+
| Digital octet string |
+--------------------------------------------+
Perl 5.8's string model will not be accepted by common people.
Information processing model of UNIX/C-ism
Information processing model of perl3 or later
Information processing model of this software
+--------------------------------------------+
| Text string as Digital octet string |
| Digital octet string as Text string |
+--------------------------------------------+
| Not UTF8 Flagged, No MOJIBAKE |
+--------------------------------------------+
In UNIX Everything is a File
In UNIX everything is a stream of bytes
In UNIX the filesystem is used as a universal name space
Native Encoding Scripting is ...
native encoding of file contents
native encoding of file name on filesystem
native encoding of command line
native encoding of environment variable
native encoding of API
native encoding of network packet
native encoding of database
Ideally, We'd like to achieve these five Goals:
Goal #1:
Old byte-oriented programs should not spontaneously break on the old byte-oriented data they used to work on.
This software attempts to achieve this goal by embedded functions work as traditional and stably.
Goal #2:
Old byte-oriented programs should magically start working on the new character-oriented data when appropriate.
This software is not a magician, so cannot see your mind and run it.
You must decide and write octet semantics or codepoint semantics yourself in case by case.
figure of Goal #1 and Goal #2.
Goal
#1 Goal #2
(a) (b) (c) (d) (e)
+--------------+-------+-------+-------+-------+-------+
| data | Old | Old | New | Old | New |
+--------------+-------+-------+-------+-------+-------+
| script | Old | Old | New |
+--------------+-------+---------------+---------------+
| interpreter | Old | New |
+--------------+-------+-------------------------------+
Old --- Old byte-oriented
New --- New codepoint-oriented
There is a combination from (a) to (e) in data, script, and interpreter of old and new. Let's add JPerl, utf8 pragma, and this software.
(a) (b) (c) (d) (e)
JPerl
UTF8::R2 utf8
+--------------+-------+-------+-------+-------+-------+
| data | Old | Old | New | Old | New |
+--------------+-------+-------+-------+-------+-------+
| script | Old | Old | New |
+--------------+-------+---------------+---------------+
| interpreter | Old | New |
+--------------+-------+-------------------------------+
Old --- Old byte-oriented
New --- New codepoint-oriented
The reason why JPerl is very excellent is that it is at the position of (c). That is, it is almost not necessary to write a special code to process new codepoint oriented script.
Goal #3:
Programs should run just as fast in the new character-oriented mode as in the old byte-oriented mode.
It is impossible. Because the following time is necessary.
(1) Time of processing class of codepoint in regular expression
Goal #4:
Perl should remain one language, rather than forking into a byte-oriented Perl and a character-oriented Perl.
JPerl remains one Perl "language" by forking to two "interpreters." However, the Perl core team did not desire fork of the "interpreter." As a result, Perl "language" forked contrary to goal #4.
A codepoint oriented perl is not necessary to make it specially, because a byte-oriented perl can already treat the binary data. This software is only Perl module of byte-oriented Perl.
And you will get support from the Perl community, when you solve the problem by the Perl script.
UTF8::R2 module remains one "language" and one "interpreter."
Goal #5:
UTF8::R2 users will be able to maintain UTF8::R2 by Perl.
May the UTF8::R2 be with you, always.
Back when Programming Perl, 3rd edition was written, UTF8 flag was not born and Perl is designed to make the easy jobs do easy. This software provides programming environment like at that time.
Perl's Motto
Some computer scientists (the reductionists, in particular) would like to deny it, but people have funny-shaped minds. Mental geography is not linear, and cannot be mapped onto a flat surface without severe distortion. But for the last score years or so, computer reductionists have been first bowing down at the Temple of Orthogonality, then rising up to preach their ideas of ascetic rectitude to any who would listen.
Their fervent but misguided desire was simply to squash your mind to fit their mindset, to smush your patterns of thought into some sort of Hyperdimensional Flatland. It's a joyless existence, being smushed.
--- Learning Perl on Win32 Systems
If you think this is a big headache, you're right. No one likes this situation, but Perl does the best it can with the input and encodings it has to deal with. If only we could reset history and not make so many mistakes next time.
--- Learning Perl 6th Edition
The most important thing for most people to know about handling Unicode data in Perl, however, is that if you don't ever use any Unicode data -- if none of your files are marked as UTF-8 and you don't use UTF-8 locales -- then you can happily pretend that you're back in Perl 5.005_03 land; the Unicode features will in no way interfere with your code unless you're explicitly using them. Sometimes the twin goals of embracing Unicode but not disturbing old-style byte-oriented scripts has led to compromise and confusion, but it's the Perl way to silently do the right thing, which is what Perl ends up doing.
--- Advanced Perl Programming, 2nd Edition
However, the ability to have any character in a string means you can create, scan, and manipulate raw binary data as string -- something with which many other utilities would have great difficulty.
--- Learning Perl 8th Edition
Combinations of UTF8::R2 Module and Other Modules
The following is a description of all the situations in this software is used in Japan.
+-------------+--------------+---------------------------------------------------------------------+
| OS encoding | I/O encoding | script encoding |
| | |----------------------------------+----------------------------------+
| | | Sjis | UTF-8 |
+-------------+--------------+----------------------------------+----------------------------------+
| | | > perl mb.pm script.pl | |
| | Sjis | | |
| | | | |
| Sjis +--------------+----------------------------------+----------------------------------+
| | UTF-8 | | |
+-------------+--------------+----------------------------------+----------------------------------+
| | | $ perl mb.pm -e sjis script.pl | |
| | Sjis | | |
| UTF-8 +--------------+----------------------------------+----------------------------------+
| | UTF-8 | | |
| | | | |
+-------------+--------------+----------------------------------+----------------------------------+
Description of combinations
----------------------------------------------------------------------
encoding
O-I-S description
----------------------------------------------------------------------
S-S-S Best choice
when
I/O is Sjis encoding
S-S-U
S-U-S
S-U-U Better choice
when
I/O is UTF-8 encoding, since not so slow
U-S-S Better choice
when
I/O is Sjis encoding, since not so slow
U-S-U
U-U-S
U-U-U Best choice
when
I/O is UTF-8 encoding
----------------------------------------------------------------------
Using Encode::decode and Encode::encode for file contents, *you* and operators lose two precious things. One is the time. Other one is the original data. Generally speaking, data conversion lose information -- unless perfectly convert one to one. Moreover, if you have made script's bug, you will know its bug on too late. If you convert encoding of file path -- not file contents, you will know its bug on the time when you test it.
Using mb.pm Modulino vs. Using UTF8::R2 Module
CPAN shows us there are mb.pm modulino and UTF8::R2 module. mb.pm modulino is a source code filter for MBCS encoding, and UTF8::R2 module is a utility for UTF-8 support. We can use each advantages using following hints.
Advantages Of mb.pm Modulino
supports many MBCS encodings, Big5, Big5-HKSCS, EUC-JP, GB18030, GBK, Sjis(also CP932), UHC, UTF-8, and WTF-8
JPerl-like syntax that supports "easy jobs must be easy"
regexp ("m//", "qr//", and "s///") works as codepoint
"split()" works as codepoint
"tr///" works as codepoint
Disadvantages Of mb.pm Modulino
have to type "perl mb.pm your_script.pl ..." on command line everytime
have obtrusive files(your_script.oo.pl)
Advantages Of UTF8::R2 Module
type only "perl your_script.pl ..." on command line
no obtrusive files(your_script.oo.pl)
Disadvantages Of UTF8::R2 Module
supports only UTF-8 encoding
have to write "$mb{qr/regexp/imsxo}" to do "m/regexp/imsxo" that works as codepoint
have to write "m<\G$mb{qr/regexp/imsxo}>gc" to do "m/regexp/imsxogc" that works as codepoint
have to write "s<$mb{qr/before/imsxo}><after>egr" to do "s/before/after/imsxoegr" that works as codepoint
have to write "mb::split(qr/regexp/, $_, 3)" to do "split(/regexp/, $_, 3)" that works as codepoint
have to write "mb::tr($_, 'A-C', 'X-Z', 'cdsr')" to do "$_ =~ tr/A-C/X-Z/cdsr" that works as codepoint
GIVE US BUG REPORT
We have tested and verified this software using the best of my ability. However, this software containing much regular expression is bound to contain some bugs. Thus, if you happen to find a bug that's in this software and not your own program, you can try to reduce it to a minimal test case and then report it to author's address. If you have an idea that could make this a more useful tool, please let share it.
How To Update This Distribution
Someday all authors of UTF8::R2 module may get run over by a bus.
So we write here how to update this distribution for you.
We wish you good luck.
(MUST) update file "UTF8/R2.pm"
(MUST) update $VERSION of file "UTF8/R2.pm"
(MUST) append to change log to file "Changes"
(if you need) update file "README"
(if you need) update or add files "t/*.t"
(if you need) update file "MANIFEST"
repeat command: pmake test [Enter] until all tests PASS
type command: pmake dist [Enter]
upload *.tar.gz to PAUSE(The [Perl programming] Authors Upload Server)
AUTHOR
INABA Hitoshi <ina@cpan.org>
This project was originated by INABA Hitoshi.
LICENSE AND COPYRIGHT
This software is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See the LICENSE file for details.
This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
SEE ALSO
perlunicode, perlunifaq, perluniintro, perlunitut, utf8, bytes,
PERL PUROGURAMINGU
Larry Wall, Randal L.Schwartz, Yoshiyuki Kondo
December 1997
ISBN 4-89052-384-7
http://www.context.co.jp/~cond/books/old-books.html
Programming Perl, Second Edition
By Larry Wall, Tom Christiansen, Randal L. Schwartz
October 1996
Pages: 670
ISBN 10: 1-56592-149-6 | ISBN 13: 9781565921498
Programming Perl, Third Edition
By Larry Wall, Tom Christiansen, Jon Orwant
Third Edition July 2000
Pages: 1104
ISBN 10: 0-596-00027-8 | ISBN 13: 9780596000271
The Perl Language Reference Manual (
for
Perl version 5.12.1)
by Larry Wall and others
Paperback (6
"x9"
), 724 pages
Retail Price: $39.95 (pound 29.95 in UK)
ISBN-13: 978-1-906966-02-7
Perl Pocket Reference, 5th Edition
By Johan Vromans
Publisher: O'Reilly Media
Released: July 2011
Pages: 102
Programming Perl, 4th Edition
By: Tom Christiansen, brian d foy, Larry Wall, Jon Orwant
Publisher: O'Reilly Media
Formats: Print, Ebook, Safari Books Online
Released: March 2012
Pages: 1130
Print ISBN: 978-0-596-00492-7 | ISBN 10: 0-596-00492-3
Ebook ISBN: 978-1-4493-9890-3 | ISBN 10: 1-4493-9890-1
Perl Cookbook
By Tom Christiansen, Nathan Torkington
August 1998
Pages: 800
ISBN 10: 1-56592-243-3 | ISBN 13: 978-1-56592-243-3
Perl Cookbook, Second Edition
By Tom Christiansen, Nathan Torkington
Second Edition August 2003
Pages: 964
ISBN 10: 0-596-00313-7 | ISBN 13: 9780596003135
Perl in a Nutshell, Second Edition
By Stephen Spainhour, Ellen Siever, Nathan Patwardhan
Second Edition June 2002
Pages: 760
Series: In a Nutshell
ISBN 10: 0-596-00241-6 | ISBN 13: 9780596002411
Learning Perl on Win32 Systems
By Randal L. Schwartz, Erik Olson, Tom Christiansen
August 1997
Pages: 306
ISBN 10: 1-56592-324-3 | ISBN 13: 9781565923249
Learning Perl, Fifth Edition
By Randal L. Schwartz, Tom Phoenix, brian d foy
June 2008
Pages: 352
Print ISBN:978-0-596-52010-6 | ISBN 10: 0-596-52010-7
Ebook ISBN:978-0-596-10316-3 | ISBN 10: 0-596-10316-6
Learning Perl, 6th Edition
By Randal L. Schwartz, brian d foy, Tom Phoenix
June 2011
Pages: 390
ISBN-10: 1449303587 | ISBN-13: 978-1449303587
Learning Perl, 8th Edition
by Randal L. Schwartz, brian d foy, Tom Phoenix
Released August 2021
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781492094951
Advanced Perl Programming, 2nd Edition
By Simon Cozens
June 2005
Pages: 300
ISBN-10: 0-596-00456-7 | ISBN-13: 978-0-596-00456-9
Perl RESOURCE KIT UNIX EDITION
Futato, Irving, Jepson, Patwardhan, Siever
ISBN 10: 1-56592-370-7
Perl Resource Kit -- Win32 Edition
Erik Olson, Brian Jepson, David Futato, Dick Hardt
ISBN 10:1-56592-409-6
MODAN Perl NYUMON
By Daisuke Maki
2009/2/10
Pages: 344
ISBN 10: 4798119172 | ISBN 13: 978-4798119175
Understanding Japanese Information Processing
By Ken Lunde
January 1900
Pages: 470
ISBN 10: 1-56592-043-0 | ISBN 13: 9781565920439
CJKV Information Processing Chinese, Japanese, Korean & Vietnamese Computing
By Ken Lunde
O'Reilly Media
Print: January 1999
Ebook: June 2009
Pages: 1128
Print ISBN:978-1-56592-224-2 | ISBN 10:1-56592-224-7
Ebook ISBN:978-0-596-55969-4 | ISBN 10:0-596-55969-0
CJKV Information Processing, 2nd Edition
By Ken Lunde
O'Reilly Media
Print: December 2008
Ebook: June 2009
Pages: 912
Print ISBN: 978-0-596-51447-1 | ISBN 10:0-596-51447-6
Ebook ISBN: 978-0-596-15782-1 | ISBN 10:0-596-15782-7
DB2 GIJUTSU ZENSHO
By BM Japan Systems Engineering Co.,Ltd. and IBM Japan, Ltd.
2004/05
Pages: 887
ISBN-10: 4756144659 | ISBN-13: 978-4756144652
Mastering Regular Expressions, Second Edition
By Jeffrey E. F. Friedl
Second Edition July 2002
Pages: 484
ISBN 10: 0-596-00289-0 | ISBN 13: 9780596002893
Mastering Regular Expressions, Third Edition
By Jeffrey E. F. Friedl
Third Edition August 2006
Pages: 542
ISBN 10: 0-596-52812-4 | ISBN 13:9780596528126
Regular Expressions Cookbook
By Jan Goyvaerts, Steven Levithan
May 2009
Pages: 512
ISBN 10:0-596-52068-9 | ISBN 13: 978-0-596-52068-7
Regular Expressions Cookbook, 2nd Edition
By Steven Levithan, Jan Goyvaerts
Released August 2012
Pages: 612
ISBN: 9781449327453
JIS KANJI JITEN
By Kouji Shibano
Pages: 1456
ISBN 4-542-20129-5
UNIX MAGAZINE
1993 Aug
Pages: 172
T1008901080816 ZASSHI 08901-8
Shell Script Magazine vol.41
2016 September
Pages: 64
LINUX NIHONGO KANKYO
By YAMAGATA Hiroo, Stephen J. Turnbull, Craig Oda, Robert J. Bickel
June, 2000
Pages: 376
ISBN 4-87311-016-5
Windows NT Shell Scripting
By Timothy Hill
April 27, 1998
Pages: 400
ISBN 10: 1578700477 | ISBN 13: 9781578700479
Windows(R) Command-Line Administrators Pocket Consultant, 2nd Edition
By William R. Stanek
February 2009
Pages: 594
ISBN 10: 0-7356-2262-0 | ISBN 13: 978-0-7356-2262-3
CPAN Directory INABA Hitoshi
Recent Perl packages by
"INABA Hitoshi"
Tokyo-pm archive
Error: Runtime exception on jperl 5.005_03
TSNETWiki
ruby-list
Announcing Perl 7
Perl 7 is coming
A vision
for
Perl 7 and beyond
On Perl 7 and the Perl Steering Committee
Perl7 and the future of Perl
Perl 7: A Risk-Benefit Analysis
Perl 7 By Default
Perl 7: A Modest Proposal
Perl 7 FAQ
Perl 7, not quite getting better yet
Re: Announcing Perl 7
Changed defaults - Are they best
for
newbies?
A vision
for
Perl 7 and beyond
https://web.archive.org/web/20200927044106/https://xdg.me/archive/2020-a-vision-
for
-perl-7-and-beyond/
Sys::Binmode - A fix
for
Perl's
system
call character encoding
File::Glob::Windows -
glob
routine
for
Windows environment.
winja - dirty patch
for
handling pathname on MSWin32::Ja_JP.cp932
Win32::Symlink - Symlink support on Windows
Win32::NTFS::Symlink - Support
for
NTFS symlinks and junctions on Microsoft Windows
Win32::Symlinks - A maintained, working implementation of Perl
symlink
built in features
for
Windows.
TANABATA - The Star Festival - common legend of east asia
ACKNOWLEDGEMENTS
This software was made referring to software and the document that the following hackers or persons had made. I am thankful to all persons.
Larry Wall, Perl
Jesse Vincent, Compatibility is a virtue
Kazumasa Utashiro, jcode.pl: Perl library
for
Japanese character code conversion, Kazumasa Utashiro
Jeffrey E. F. Friedl, Mastering Regular Expressions
SADAHIRO Tomoyuki, Handling of Shift-JIS text correctly using bare Perl
Yukihiro
"Matz"
Matsumoto, YAPC::Asia2006 Ruby on Perl(s)
jscripter, For jperl users
Bruce., Unicode in Perl
chaichanPaPa, Matching Shift_JIS file name
SUZUKI Norio, Jperl
http://www.dennougedougakkai-ndd.org/alte/3tte/jperl-5.005_03
@ap522
/homepage2.nifty.com..kipp..perl..jperl..
index
.html
WATANABE Hirofumi, Jperl
Chuck Houpt, Michiko Nozu, MacJPerl
Kenichi Ishigaki, 31st about encoding; To JPerl users as old men
Fuji, Goro (gfx), Perl Hackers Hub No.16
Dan Kogai, Encode module
Takahashi Masatuyo, JPerl Wiki
Juerd, Perl Unicode Advice
daily dayflower, 2008-06-25 perluniadvice
Unicode issues in Perl
numa's Diary: CSI and UCS Normalization
https://srad.jp/~numa/journal/580177/
Unicode Processing on Windows
with
Perl
Kaoru Maeda, Perl's history Perl 1,2,3,4
nurse, What is
"string"
https://naruse.hateblo.jp/entries/2014/11/07
#1415355181
NISHIO Hirokazu, What's meant
"string as a sequence of characters"
?
Rick Yamashita, Shift_JIS
https://shino.tumblr.com/post/116166805/
%E5
%B1
%B1
%E4
%B8
%8B
%E8
%89
%AF
%E8
%94
%B5
%E3
%81
%A8
%E7
%94
%B3
%E3
%81%97
%E3
%81
%BE
%E3
%81%99-
%E7
%A7
%81
%E3
%81
%AF1981
%E5
%B9
%B4
%E5
%BD
%93
%E6
%99%82us
%E3
%81
%AE
%E3
%83%9E
%E3
%82
%A4
%E3
%82
%AF
%E3
%83
%AD
%E3
%82
%BD
%E3
%83%95
%E3
%83%88
%E3
%81
%A7
%E3
%82
%B7
%E3
%83%95
%E3
%83%88jis
%E3
%81
%AE
%E3
%83%87
%E3
%82
%B6
%E3
%82
%A4
%E3
%83
%B3
%E3
%82%92
%E6
%8B%85
%E5
%BD
%93
nurse, History of Japanese EUC 22:00
Mike Whitaker, Perl And Unicode
Ricardo Signes, Perl 5.14
for
Pragmatists
Ricardo Signes, What
's New in Perl? v5.10 - v5.16 #'
YAP(achimon)C::Asia Hachioji 2016 mid in Shinagawa
Kenichi Ishigaki (
@charsbar
) July 3, 2016 YAP(achimon)C::Asia Hachioji 2016mid
Causes and countermeasures
for
garbled Japanese characters in perl
Perl regular expression bug?
Impressions of talking of Larry Wall at LL Future
About Windows and Japanese text
About Windows diagnostic data