Lingua::JA::NormalizeText - Text Normalizer
use Lingua::JA::NormalizeText; use utf8; my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu ); my $normalizer = Lingua::JA::NormalizeText->new(@options); print $normalizer->normalize('鳥が㌧㌦でありんす♥'); # -> 鳥がトンドルです♥ sub dearinsu_to_desu { my $text = shift; $text =~ s/でありんす/です/g; return $text; }
# or
use Lingua::JA::NormalizeText qw/old2new_kanji/; use utf8; print old2new_kanji('惡の華'); # -> 悪の華
Lingua::JA::NormalizeText normalizes text.
Creates a new Lingua::JA::NormalizeText instance.
The following options are available:
OPTION SAMPLE INPUT OUTPUT FOR SAMPLE INPUT --------------------- --------------------- ----------------------- lc DdD ddd uc DdD DDD nfkc ㌦ ドル (length: 2) nfkd ㌦ ドル (length: 3) nfc nfd decode_entities ♥ ♥ strip_html <em>あ</em> あ alnum_z2h ABC123 ABC123 alnum_h2z ABC123 ABC123 space_z2h space_h2z katakana_z2h ハァハァ ハァハァ katakana_h2z スーハースーハー スーハースーハー katakana2hiragana パンツ ぱんつ hiragana2katakana ぱんつ パンツ wave2tilde 〜 ~ tilde2wave ~ 〜 wavetilde2long 〜, ~ ー wave2long 〜 ー tilde2long ~ ー fullminus2long − ー dashes2long — ー drawing_lines2long ─ ー unify_long_repeats ヴァーーー ヴァー nl2space (LF)(CR)(CRLF} (space)(space)(space) unify_nl (LF)(CR)(CRLF) \n\n\n unify_long_spaces あ(space)(space)あ あ(space)あ unify_whitespaces \x{00A0} (space) trim (space)あ(space)あ(space) あ(space)あ ltrim (space)あ(space) あ(space) rtrim ああ(space)(space) ああ old2new_kana ゐヰゑヱヸヹ いイえエイ゙エ゙ old2new_kanji 亞逸鬭 亜逸闘 tab2space (tab)(tab) (space)(space) remove_controls あ\x{0000}あ ああ remove_spaces (space)あ(space)あ(space) ああ dakuon_normalize さ\x{3099} ざ handakuon_normalize は\x{309A} ぱ all_dakuon_normalize さ\x{3099}は\x{309A} ざぱ
The order in which these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied last.)
External functions are also addable. (See dearinsu_to_desu function of the SYNOPSIS section.)
normalizes $text.
Note that this option does not convert hyphens into long.
Note that this option unifies only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000).
Note that this option does not remove the following characters:
CHARACTER TABULATION LINE FEED CARRIAGE RETURN
Note that this option removes only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000).
This option converts the following characters into SPACE(U+0020).
LINE TABULATION FORM FEED NEXT LINE NO-BREAK SPACE OGHAM SPACE MARK MONGOLIAN VOWEL SEPARATOR EN QUAD EM QUAD EN SPACE EM SPACE THREE-PER-EM SPACE FOUR-PER-EM SPACE SIX-PER-EM SPACE FIGURE SPACE PUNCTUATION SPACE THIN SPACE HAIR SPACE LINE SEPARATOR PARAGRAPH SEPARATOR NARROW NO-BREAK SPACE MEDIUM MATHEMATICAL SPACE
Note that this does not convert the following characters:
CHARACTER TABULATION LINE FEED CARRIAGE RETURN IDEOGRAPHIC SPACE
pawa <pawapawa@cpan.org>
新旧字体表: http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html
Lingua::JA::Regular::Unicode
Lingua::JA::Dakuon
Lingua::JA::Moji
Unicode::Normalize
HTML::Entities
HTML::Scrubber
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install Lingua::JA::NormalizeText, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::JA::NormalizeText
CPAN shell
perl -MCPAN -e shell install Lingua::JA::NormalizeText
For more information on module installation, please visit the detailed CPAN module installation guide.