NAME

Lingua::JA::NormalizeText - Text Normalizer

SYNOPSIS

  use Lingua::JA::NormalizeText;
  use utf8;

  my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu );
  my $normalizer = Lingua::JA::NormalizeText->new(@options);

  print $normalizer->normalize('鳥が㌧㌦でありんす&hearts;');
  # -> 鳥がトンドルです♥

  sub dearinsu_to_desu
  {
      my $text = shift;
      $text =~ s/でありんす/です/g;

      return $text;
  }

# or

  use Lingua::JA::NormalizeText qw/old2new_kanji/;
  use utf8;

  print old2new_kanji('惡の華');
  # -> 悪の華

DESCRIPTION

Lingua::JA::NormalizeText normalizes text.

METHODS

new(@options)

Creates a new Lingua::JA::NormalizeText instance.

The following options are available:

  OPTION                 SAMPLE INPUT           OUTPUT FOR SAMPLE INPUT
  ---------------------  ---------------------  -----------------------
  lc                     DdD                    ddd
  uc                     DdD                    DDD
  nfkc                   ㌦                     ドル (length: 2)
  nfkd                   ㌦                     ドル (length: 3)
  nfc
  nfd
  decode_entities        &hearts;               ♥
  strip_html             <em>あ</em>                あ    
  alnum_z2h              ＡＢＣ１２３           ABC123
  alnum_h2z              ABC123                 ＡＢＣ１２３
  space_z2h
  space_h2z
  katakana_z2h           ハァハァ               ﾊｧﾊｧ
  katakana_h2z           ｽｰﾊｰｽｰﾊｰ               スーハースーハー
  katakana2hiragana      パンツ                 ぱんつ
  hiragana2katakana      ぱんつ                 パンツ
  wave2tilde             〜                     ～
  tilde2wave             ～                     〜
  wavetilde2long         〜, ～                 ー
  wave2long              〜                     ー
  tilde2long             ～                     ー
  fullminus2long         −                      ー
  dashes2long            —                      ー
  drawing_lines2long     ─                      ー
  unify_long_repeats     ヴァーーー             ヴァー
  nl2space               (LF)(CR)(CRLF}         (space)(space)(space)
  unify_nl               (LF)(CR)(CRLF)         \n\n\n
  unify_long_spaces      あ(space)(space)あ     あ(space)あ
  unify_whitespaces      \x{00A0}               (space)
  trim                   (space)あ(space)あ(space)  あ(space)あ
  ltrim                  (space)あ(space)       あ(space)
  rtrim                  ああ(space)(space)     ああ
  old2new_kana           ゐヰゑヱヸヹ           いイえエイ゙エ゙
  old2new_kanji          亞逸鬭                 亜逸闘
  tab2space              (tab)(tab)             (space)(space)
  remove_controls        あ\x{0000}あ           ああ
  remove_spaces          (space)あ(space)あ(space)  ああ
  dakuon_normalize       さ\x{3099}             ざ
  handakuon_normalize    は\x{309A}             ぱ
  all_dakuon_normalize   さ\x{3099}は\x{309A}   ざぱ

The order in which these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied last.)

External functions are also addable. (See dearinsu_to_desu function of the SYNOPSIS section.)

normalize($text)

normalizes $text.

OPTIONS

dashes2long

Note that this option does not convert hyphens into long.

unify_long_spaces

Note that this option unifies only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000).

remove_controls

Note that this option does not remove the following characters:

  CHARACTER TABULATION
  LINE FEED
  CARRIAGE RETURN

remove_spaces

  Note that this option removes only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000).

unify_whitespaces

This option converts the following characters into SPACE(U+0020).

  LINE TABULATION
  FORM FEED
  NEXT LINE
  NO-BREAK SPACE
  OGHAM SPACE MARK
  MONGOLIAN VOWEL SEPARATOR
  EN QUAD
  EM QUAD
  EN SPACE
  EM SPACE
  THREE-PER-EM SPACE
  FOUR-PER-EM SPACE
  SIX-PER-EM SPACE
  FIGURE SPACE
  PUNCTUATION SPACE
  THIN SPACE
  HAIR SPACE
  LINE SEPARATOR
  PARAGRAPH SEPARATOR
  NARROW NO-BREAK SPACE
  MEDIUM MATHEMATICAL SPACE

Note that this does not convert the following characters:

  CHARACTER TABULATION
  LINE FEED
  CARRIAGE RETURN
  IDEOGRAPHIC SPACE

AUTHOR

pawa <pawapawa@cpan.org>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install Lingua::JA::NormalizeText, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lingua::JA::NormalizeText

CPAN shell

perl -MCPAN -e shell
install Lingua::JA::NormalizeText

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)