ShiftJIS::String - functions to manipulate Shift_JIS encoded strings
use ShiftJIS::String; ShiftJIS::String::substr($str, ShiftJIS::String::index($str, $substr));
This POD is written in Shift_JIS encoding.
Do you see '‚ ' as HIRAGANA LETTER A? or '\' as YEN SIGN, not as REVERSE SOLIDUS? Otherwise you'd change your font to an appropriate one. (or the POD might be badly converted.)
‚
HIRAGANA LETTER A
\
YEN SIGN
REVERSE SOLIDUS
This module provides some functions which emulate the corresponding CORE functions and helps someone to manipulate multiple-byte character sequences in Shift_JIS encoding.
CORE
* 'Hankaku' and 'Zenkaku' mean 'halfwidth' and 'fullwidth' characters in Japanese, respectively.
issjis(LIST)
Returns a boolean indicating whether all the strings in the parameter list are legally encoded in Shift_JIS.
length(STRING)
Returns the length in characters of the supplied string.
strrev(STRING)
Returns a reversed string (having all characters in the opposite order).
index(STRING, SUBSTR)
index(STRING, SUBSTR, POSITION)
Returns the position of the first occurrence of SUBSTR in STRING at or after POSITION. If POSITION is omitted, starts searching from the beginning of the string.
SUBSTR
STRING
POSITION
If the substring is not found, returns -1.
rindex(STRING, SUBSTR)
rindex(STRING, SUBSTR, POSITION)
Returns the position of the last occurrence of SUBSTR in STRING at or after POSITION. If POSITION is specified, returns the last occurrence at or before that position.
strspn(STRING, SEARCHLIST)
Returns returns the position of the first occurrence of any character not contained in the search list.
strspn("+0.12345*12", "+-.0123456789"); # returns 8.
If the specified string does not contain any character in the search list, returns 0.
The string consists of characters in the search list, the returned value equals the length of the string.
strcspn(STRING, SEARCHLIST)
Returns returns the position of the first occurrence of any character contained in the search list.
strcspn("Perl‚Í–Ê”’‚¢�B", "�Ô�‰©”’�•"); # returns 6.
If the specified string does not contain any character in the search list, the returned value equals the length of the string.
substr(STRING or SCALAR REF, OFFSET)
substr(STRING or SCALAR REF, OFFSET, LENGTH)
substr(SCALAR, OFFSET, LENGTH, REPLACEMENT)
It works like CORE::substr, but using character semantics of Shift_JIS encoding.
CORE::substr
If the REPLACEMENT as the fourth parameter is specified, replaces parts of the SCALAR and returns what was there before.
REPLACEMENT
SCALAR
You can utilize the lvalue reference, returned if a reference of scalar variable is used as the first argument.
${ &substr(\$str,$off,$len) } = $replace; works like CORE::substr($str,$off,$len) = $replace;
The returned lvalue is not Shift_JIS-oriented but byte-oriented, then successive assignment may cause unexpected results.
$str = "0123456789"; $lval = &substr(\$str,3,1); $$lval = "‚ ‚¢"; $$lval = "a"; # $str is NOT "012a‚¢456789", but an illegal string "012a\xA0‚¢456789".
strsplit(SEPARATOR, STRING)
strsplit(SEPARATOR, STRING, LIMIT)
This function emulates CORE::split, but splits on the SEPARATOR string, not by a pattern. If not in list context, only return the number of fields found, but does not split into the @_ array.
CORE::split
SEPARATOR
@_
strsplit('||', '||‚ ‚¢‚¤‚¦‚¨||ƒpƒsƒvƒyƒ|||01234||'); # ('', '‚ ‚¢‚¤‚¦‚¨', 'ƒpƒsƒvƒyƒ|', '01234') strsplit('�^', 'Perl�^épék�^Camel'); # ('Perl', 'épék', 'Camel')
If an empty string is specified as SEPARATOR, splits the specified string into characters (similarly to CORE::split //, STRING, LIMIT).
CORE::split //, STRING, LIMIT
strsplit('', 'This is Perl.', 7); # ('T', 'h', 'i', 's', ' ', 'i', 's Perl.')
If an undefined value is specified as SEPARATOR, splits the specified string on whitespace characters (including IDEOGRAPHIC SPACE). Leading whitespace characters do not produce any field (similarly to CORE::split ' ', STRING, LIMIT).
IDEOGRAPHIC SPACE
CORE::split ' ', STRING, LIMIT
strsplit(undef, ' �@ This is �@ Perl.'); # ('This', 'is', 'Perl.')
strcmp(LEFT-STRING, RIGHT-STRING)
Returns 1 (when LEFT-STRING is greater than RIGHT-STRING) or 0 (when LEFT-STRING is equal to RIGHT-STRING) or -1 (when LEFT-STRING is lesser than RIGHT-STRING).
LEFT-STRING
RIGHT-STRING
The order is roughly as shown the following list.
JIS X 0201 Roman, JIS X 0201 Kana, then JIS X 0208 Kanji (Zenkaku).
For example, 0x41 as 'A' is lesser than 0xB1 ('±' HANKAKU KATAKANA A). 0xB1 as '±' is lesser than 0x8341 ('ƒA' KATAKANA A). 0x8341 as 'ƒA' is lesser than 0x8383 ('ƒƒ' KATAKANA SMALL YA). 0x8383 as 'ƒƒ' is lesser than 0x83B1 ('ƒ±' GREEK CAPITAL TAU).
0x41
'A'
0xB1
'±' HANKAKU KATAKANA A
'±'
0x8341
'ƒA' KATAKANA A
'ƒA'
0x8383
'ƒƒ' KATAKANA SMALL YA
'ƒƒ'
0x83B1
'ı' GREEK CAPITAL TAU
Caveat! Compare the 2nd and the 4th examples. Byte "\xB1" is lesser than byte "\x83" as the leading bytes; while greater as the trailing bytes. Shortly, the ordering as binary is broken for the Shift_JIS codepoint order.
"\xB1"
"\x83"
strEQ(LEFT-STRING, RIGHT-STRING)
Returns a boolean whether LEFT-STRING is equal to RIGHT-STRING.
Note: strEQ is an expensive equivalence of the CORE's eq operator.
strEQ
eq
strNE(LEFT-STRING, RIGHT-STRING)
Returns a boolean whether LEFT-STRING is not equal to RIGHT-STRING.
Note: strNE is an expensive equivalence of the CORE's ne operator.
strNE
ne
strLT(LEFT-STRING, RIGHT-STRING)
Returns a boolean whether LEFT-STRING is lesser than RIGHT-STRING.
strLE(LEFT-STRING, RIGHT-STRING)
Returns a boolean whether LEFT-STRING is lesser than or equal to RIGHT-STRING.
strGT(LEFT-STRING, RIGHT-STRING)
Returns a boolean whether LEFT-STRING is greater than RIGHT-STRING.
strGE(LEFT-STRING, RIGHT-STRING)
Returns a boolean whether LEFT-STRING is greater than or equal to RIGHT-STRING.
strxfrm(STRING)
Returns a string transformed so that CORE:: cmp can be used for binary comparisons (NOT the length of the transformed string).
CORE:: cmp
I.e. strxfrm($a) cmp strxfrm($b) is equivalent to strcmp($a, $b), as long as your cmp doesn't use any locale other than that of Perl.
strxfrm($a) cmp strxfrm($b)
strcmp($a, $b)
cmp
mkrange(EXPR, EXPR)
Returns the character list (not in list context, as a concatenated string) gained by parsing the specified character range.
A character range is specified with a HYPHEN-MINUS, '-'. The backslashed combinations '\-' and '\\' are used instead of the characters '-' and '\', respectively. The hyphen at the beginning or end of the range is also evaluated as the hyphen itself.
HYPHEN-MINUS
'-'
'\-'
'\\'
'\'
For example, mkrange('+\-0-9a-fA-F') returns ('+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'A', 'B', 'C', 'D', 'E', 'F') and scalar mkrange('‚©-‚²') returns '‚©‚ª‚«‚¬‚‚®‚¯‚°‚±‚²'.
mkrange('+\-0-9a-fA-F')
('+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'A', 'B', 'C', 'D', 'E', 'F')
scalar mkrange('‚©-‚²')
'‚©‚ª‚«‚¬‚‚®‚¯‚°‚±‚²'
The order of Shift_JIS characters is: 0x00 .. 0x7F, 0xA1 .. 0xDF, 0x8140 .. 0x9FFC, 0xE040 .. 0xFCFC. So, mkrange('ˆŸ-˜r') returns the list of all characters in level 1 Kanji.
0x00 .. 0x7F, 0xA1 .. 0xDF, 0x8140 .. 0x9FFC, 0xE040 .. 0xFCFC
If true value is specified as the second parameter, Reverse character ranges such as '9-0', 'Z-A' can be used; otherwise, reverse character ranges are croaked.
'9-0'
'Z-A'
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN, TOPATTERN)
Transliterates all occurrences of the characters found in the search list with the corresponding character in the replacement list.
If a reference of scalar variable is specified as the first argument, returns the number of characters replaced or deleted; otherwise, returns the transliterated string and the specified string is unaffected.
$str = "‚È‚ñ‚Æ‚¢‚¨‚¤‚©"; print strtr(\$str,"‚ ‚¢‚¤‚¦‚¨", "ƒAƒCƒEƒGƒI"), " ", $str; # output: 3 ‚È‚ñ‚ƃCƒIƒE‚© $str = "Œã–å‚̘T�B"; print strtr($str,"Œã˜T�B", "‘OŒÕ�A"), $str; # output: ‘O–å‚ÌŒÕ�AŒã–å‚̘T�B
SEARCHLIST and REPLACEMENTLIST
Character ranges such as "‚Ÿ-‚¨" (internally utilizing mkrange()) are supported.
"‚Ÿ-‚¨"
mkrange()
If the REPLACEMENTLIST is empty (specified as '', not undef, because the use of uninitialized value causes warning under -w option), the SEARCHLIST is replicated.
REPLACEMENTLIST
''
undef
SEARCHLIST
If the replacement list is shorter than the search list, the final character in the replacement list is replicated till it is long enough (but differently works when the 'd' modifier is used).
strtr(\$str, '‚Ÿ-‚ñƒ@-ƒ–¦-ß', '#'); # replaces all Kana letters by '#'.
MODIFIER
c Complement the SEARCHLIST. d Delete found but unreplaced characters. s Squash duplicate replaced characters. R No use of character ranges. r Allows to use reverse character ranges. o Caches the conversion table internally. strtr(\$str, '‚Ÿ-‚ñƒ@-ƒ–¦-ß', ''); # counts all Kana letters in $str. $onlykana = strtr($str, '‚Ÿ-‚ñƒ@-ƒ–¦-ß', '', 'cd'); # deletes all characters except Kana letters. strtr(\$str, " \x81\x40\n\r\t\f", '', 'd'); # deletes all whitespace characters including IDEOGRAPHIC SPACE. strtr("‚¨‚©‚©‚¤‚ß‚Ú‚µ�@‚¿‚¿‚Æ‚Í‚Í", '‚Ÿ-‚ñ', '', 's'); # output: ‚¨‚©‚¤‚ß‚Ú‚µ�@‚¿‚Æ‚Í strtr("�ðŒ�‰‰ŽZŽq‚ÌŽg‚¢‚·‚¬‚ÍŒ©‹ê‚µ‚¢", '‚Ÿ-‚ñ', '�”', 'cs'); # output: �”‚Ì�”‚¢‚·‚¬‚Í�”‚µ‚¢
If 'R' modifier is specified, '-' is not evaluated as a meta character but HYPHEN-MINUS itself like in tr'''. Compare:
'R'
tr'''
strtr("90 - 32 = 58", "0-9", "A-J"); # output: "JA - DC = FI" strtr("90 - 32 = 58", "0-9", "A-J", "R"); # output: "JA - 32 = 58" # cf. ($str = "90 - 32 = 58") =~ tr'0-9'A-J'; # '0' to 'A', '-' to '-', and '9' to 'J'.
If 'r' modifier is specified, you are allowed to use reverse character ranges. For example, strtr($str, "0-9", "9-0", "r") is equivalent to strtr($str, "0123456789", "9876543210").
'r'
strtr($str, "0-9", "9-0", "r")
strtr($str, "0123456789", "9876543210")
strtr($text, 'ˆŸ-˜r', '˜r-ˆŸ', "r"); # Your text may seem to be clobbered.
PATTERN and TOPATTERN
By use of PATTERN and TOPATTERN, you can transliterate the string using lists containing some multi-character substrings.
PATTERN
TOPATTERN
If called with four arguments, SEARCHLIST, REPLACEMENTLIST and STRING are splited characterwise;
If called with five arguments, a multi-character substring that matchs PATTERN in SEARCHLIST, REPLACEMENTLIST or STRING is regarded as an transliteration unit.
If both PATTERN and TOPATTERN are specified, a multi-character substring either that matchs PATTERN in SEARCHLIST or STRING, or that matchs TOPATTERN in REPLACEMENTLIST is regarded as an transliteration unit.
print strtr( "Caesar Aether Goethe", "aeoeueAeOeUe", "äööÄÖÜ", "", "[aouAOU]e", "&[aouAOU]uml;"); # output: Cäsar Äther Göthe
LIST as Anonymous Array
Instead of specification of PATTERN and TOPATTERN, you can use anonymous arrays as SEARCHLIST and/or REPLACEMENTLIST as follows.
print strtr( "Caesar Aether Goethe", [qw/ae oe ue Ae Oe Ue/], [qw/ä ö ö Ä Ö Ü/] );
Caching the conversion table
If 'o' modifier is specified, the conversion table is cached internally. e.g.
'o'
foreach (@hiragana_strings) { print strtr($_, '‚Ÿ-‚ñ', 'ƒ@-ƒ“', 'o'); } # katakana strings are printed
will be almost as efficient as this:
$hiragana_to_katakana = trclosure('‚Ÿ-‚ñ', 'ƒ@-ƒ“'); foreach (@hiragana_strings) { print &$hiragana_to_katakana($_); }
You can use whichever you like.
Without 'o',
foreach (@hiragana_strings) { print strtr($_, '‚Ÿ-‚ñ', 'ƒ@-ƒ“'); }
will be very slow since the conversion table is made whenever the function is called.
trclosure(SEARCHLIST, REPLACEMENTLIST)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN, TOPATTERN)
Returns a closure to transliterate the specified string. The return value is an only code reference, not blessed object. By use of this code ref, you can save yourself time as you need not specify the parameter list every time.
my $digit_tr = trclosure("1234567890-", "ˆê“ñŽOŽlŒÜ˜ZŽµ”ª‹ã�Z�|"); print &$digit_tr ("TEL �F0124-45-6789\n"); # ok to perl 5.003 print $digit_tr->("FAX �F0124-51-5368\n"); # perl 5.004 or better # output: # “d˜b�F�Zˆê“ñŽl�|ŽlŒÜ�|˜ZŽµ”ª‹ã # FAX �F�Zˆê“ñŽl�|ŒÜˆê�|ŒÜŽO˜Z”ª
The functionality of the closure made by trclosure() is equivalent to that of strtr(). Frankly speaking, the strtr() calls trclosure() internally and uses the returned closure.
trclosure()
strtr()
toupper(STRING or SCALAR REF)
Returns an uppercased string of STRING. Converts only half-width Latin characters a-z to A-Z.
a-z
A-Z
If a reference of scalar variable is specified as the first argument, the string referred to it is uppercased and the number of characters replaced is returned.
tolower(STRING or SCALAR REF)
Returns a lowercased string of STRING. Converts only half-width Latin characters A-Z to a-z.
If a reference of scalar variable is specified as the first argument, the string referred to it is lowercased and the number of characters replaced is returned.
If a reference of scalar variable is specified as the first argument, the string referred to it is converted and the number of characters replaced is returned. Otherwise, returns a string converted and the specified string is unaffected.
Note: The conversion between a voiced (or semivoiced) katakana (or hiragana), such as 'ƒK', 'ƒp', and hankaku katakana with a voiced mark or a semi-voiced mark, such as '¶Þ', 'Êß', is counted as 1. Similarly, the conversion between zenkaku hiragana '‚¤�J' and zenkaku katakana 'ƒ”' is counted as 1.
'ƒK'
'ƒp'
'¶Þ'
'Êß'
'‚¤�J'
'ƒ”'
kanaH2Z(STRING or SCALAR REF)
kataH2Z(STRING or SCALAR REF)
Converts Hankaku Katakana to Zenkaku Katakana
Note: kataH2Z is an alias of kanaH2Z.
kataH2Z
kanaH2Z
kataZ2H(STRING or SCALAR REF)
Converts Zenkaku Katakana to Hankaku Katakana
kanaZ2H(STRING or SCALAR REF)
Converts Zenkaku Hiragana and Katakana to Hankaku Katakana
hiXka(STRING or SCALAR REF)
Converts Zenkaku Hiragana to Zenkaku Katakana and Zenkaku Katakana to Zenkaku Hiragana at once.
hi2ka(STRING or SCALAR REF)
Converts Zenkaku Hiragana to Zenkaku Katakana
ka2hi(STRING or SCALAR REF)
Converts Zenkaku Katakana to Zenkaku Hiragana
spaceH2Z(STRING or SCALAR REF)
Converts space (half-width) to ideographic space (full-width) in the specified string and returns the converted string.
spaceZ2H(STRING or SCALAR REF)
Converts ideographic space (full-width) to space (half-width) in the specified string and returns the converted string.
A legal Shift_JIS character in this module must match the following regular expression:
[\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]
Any string from an external source should be checked by issjis() function, excepting you know it is surely encoded in Shift_JIS.
issjis()
Use of an illegal Shift_JIS string may lead to odd results.
Some Shift_JIS double-byte characters have a trailing byte in the range of [\x40-\x7E], viz.,
[\x40-\x7E]
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
The Perl lexer (parhaps) doesn't take any care to these bytes, so they sometimes make trouble. e.g. the quoted literal "•\" causes a fatal error, since its trailing byte 0x5C backslashes the closing quote.
"•\"
0x5C
Such a problem doesn't arise when the string is gotten from any external resource. But writing the script containing Shift_JIS double-byte characters needs the greatest care.
The use of single-quoted heredoc, << '', or \xhh meta characters is recommended in order to define a Shift_JIS string literal.
<< ''
\xhh
The safe ASCII-graphic characters, [\x21-\x3F], are:
[\x21-\x3F]
!"#$%&'()*+,-./0123456789:;<=>?
They are preferred as the delimiter of quote-like operators.
This library supposes $[ is always equal to 0, never 1.
$[
The functions provided by this library use many regexp operations. Therefore, $1 etc. values may be changed or discarded unexpectedly. I suggest you save it in a certain variable before call of the function.
$1
Tomoyuki SADAHIRO
bqw10602@nifty.com http://homepage1.nifty.com/nomenclator/perl/ Copyright(C) 2001-2002, SADAHIRO Tomoyuki. Japan. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
Non-ASCII character seen before =encoding in ''C<‚ >''. Assuming CP1252
To install ShiftJIS::String, copy and paste the appropriate command in to your terminal.
cpanm
cpanm ShiftJIS::String
CPAN shell
perl -MCPAN -e shell install ShiftJIS::String
For more information on module installation, please visit the detailed CPAN module installation guide.