The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::JA::NormalizeText - Text Normalizer

SYNOPSIS

  use Lingua::JA::NormalizeText;
  use utf8;

  my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu );
  my $normalizer = Lingua::JA::NormalizeText->new(@options);

  print $normalizer->normalize('鳥が㌧㌦でありんす♥');
  # -> 鳥がトンドルです♥

  sub dearinsu_to_desu
  {
      my $text = shift;
      $text =~ s/でありんす/です/g;

      return $text;
  }

# or

  use Lingua::JA::NormalizeText qw/old2new_kanji/;
  use utf8;

  print old2new_kanji('惡の華');
  # -> 悪の華

DESCRIPTION

Lingua::JA::NormalizeText normalizes text.

METHODS

new(@options)

Creates a new Lingua::JA::NormalizeText instance.

The following options are available:

  OPTION                 SAMPLE INPUT           OUTPUT FOR SAMPLE INPUT
  ---------------------  ---------------------  -----------------------
  lc                     DdD                    ddd
  uc                     DdD                    DDD
  nfkc                   ㌦                     ドル (length: 2)
  nfkd                   ㌦                     ドル (length: 3)
  nfc
  nfd
  decode_entities        ♥               ♥
  strip_html             <em>あ</em>                あ    
  alnum_z2h              ABC123           ABC123
  alnum_h2z              ABC123                 ABC123
  space_z2h
  space_h2z
  katakana_z2h           ハァハァ               ハァハァ
  katakana_h2z           スーハースーハー               スーハースーハー
  katakana2hiragana      パンツ                 ぱんつ
  hiragana2katakana      ぱんつ                 パンツ
  wave2tilde             〜                     ~
  tilde2wave             ~                     〜
  wavetilde2long         〜, ~                 ー
  wave2long              〜                     ー
  tilde2long             ~                     ー
  fullminus2long         −                      ー
  dashes2long            —                      ー
  drawing_lines2long     ─                      ー
  unify_long_repeats     ヴァーーー             ヴァー
  nl2space               (LF)(CR)(CRLF}         (space)(space)(space)
  unify_nl               (LF)(CR)(CRLF)         \n\n\n
  unify_long_spaces      あ(space)(space)あ     あ(space)あ
  unify_whitespaces      \x{00A0}               (space)
  trim                   (space)あ(space)あ(space)  あ(space)あ
  ltrim                  (space)あ(space)       あ(space)
  rtrim                  ああ(space)(space)     ああ
  old2new_kana           ゐヰゑヱヸヹ           いイえエイ゙エ゙
  old2new_kanji          亞逸鬭                 亜逸闘
  tab2space              (tab)(tab)             (space)(space)
  remove_controls        あ\x{0000}あ           ああ
  dakuon_normalize       さ\x{3099}             ざ
  handakuon_normalize    は\x{309A}             ぱ
  all_dakuon_normalize   さ\x{3099}は\x{309A}   ざぱ

The order in which these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied last.)

External functions are also addable. (See dearinsu_to_desu function of the SYNOPSIS section.)

normalize($text)

normalizes $text.

OPTIONS

dashes2long

Note that this option does not convert hyphens into long.

unify_long_spaces

Note that this option unifies only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000).

remove_controls

Note that this option does not remove the following chars:

  CHARACTER TABULATION
  LINE FEED
  CARRIAGE RETURN

unify_whitespaces

This option converts the following chars into SPACE(U+0020).

  LINE TABULATION
  FORM FEED
  NEXT LINE
  NO-BREAK SPACE
  OGHAM SPACE MARK
  MONGOLIAN VOWEL SEPARATOR
  EN QUAD
  EM QUAD
  EN SPACE
  EM SPACE
  THREE-PER-EM SPACE
  FOUR-PER-EM SPACE
  SIX-PER-EM SPACE
  FIGURE SPACE
  PUNCTUATION SPACE
  THIN SPACE
  HAIR SPACE
  LINE SEPARATOR
  PARAGRAPH SEPARATOR
  NARROW NO-BREAK SPACE
  MEDIUM MATHEMATICAL SPACE

Note that this does not convert the following chars:

  CHARACTER TABULATION
  LINE FEED
  CARRIAGE RETURN
  IDEOGRAPHIC SPACE

AUTHOR

pawa <pawapawa@cpan.org>

SEE ALSO

新旧字体表: http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html

Lingua::JA::Regular::Unicode

Lingua::JA::Dakuon

Lingua::JA::Moji

Unicode::Normalize

HTML::Entities

HTML::Scrubber

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.