The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::JA::TFWebIDF - TF*WebIDF calculator

SYNOPSIS

  use Lingua::JA::TFWebIDF;
  use utf8;
  use feature qw/say/;
  use Data::Printer;

  my $tfidf = Lingua::JA::TFWebIDF->new(
      pos1_filter => [qw/非自立 代名詞 ナイ形容詞語幹 副詞可能 サ変接続/],
      ng_word     => [qw/編集 本人 自身 自分 たち さん 年 月 日/],
  );

  my %tf = (
      '自然言語処理' => 9,
      '自然言語'     => 6,
      '自然言語理解' => 4,
      '処理'         => 5,
      '解析'         => 4,
  );

  p $tfidf->tfidf(\%tf)->dump;

  p $tfidf->tfidf($text)->dump;
  p $tfidf->tf($text)->dump;

  for my $result (@{ $tfidf->tfidf($text)->list(20) })
  {
      my ($word, $weight) = each %{$result};

      say "$word: $weight";
  }

DESCRIPTION

Lingua::JA::TFWebIDF calculates TF*WebIDF weight.

Compared with Lingua::JA::TFIDF, this module has the following advantages.

  • supports Tokyo Cabinet and many options.

  • tfidf method accepts \%tf. (This facilitates the use of other morphological analyzers.)

METHODS

new( %config || \%config )

Creates a new Lingua::JA::TFWebIDF instance.

The following configuration is used if you don't set %config.

  KEY                 DEFAULT VALUE
  -----------         ---------------
  pos1_filter         [qw/非自立 代名詞 ナイ形容詞語幹 副詞可能/]
  pos2_filter         []
  pos3_filter         []
  ng_word             []
  term_length_min     2
  term_length_max     30
  concat_max          30
  tf_min              1
  df_min              0
  df_max              250_0000_0000
  fetch_unk_word_df   0
  db_auto             1
  guess_df            1

  idf_type            1
  api                 'YahooPremium'
  appid               undef
  driver              'TokyoCabinet'
  df_file             './df.tch'
  fetch_df            0
  expires_in          365
  documents           250_0000_0000
  Furl_HTTP           undef
  verbose             0
pos(1|2|3)_filter => \@pos

The filters of '品詞細分類'.

concat_max => $num

The maximum value of the number of term concatenations.

If 2 is specified, 2 consecutive nouns are concatenated. I recommend that you specify a large value or 0.

fetch_df => 0 || 1

If df_file has the (Web)DF and it has not expired yet, it is used.

If it is not so,

1: fetches the DF of the word which exists in the dictionary of MeCab.

0: if guess_df is 1, guesses the DF.

if guess_df is 0, does not give TF*WebIDF weight.

fetch_unk_word_df => 0 || 1

'unk word' is a word which not exists in the dictionary of MeCab.

If df_file has the (Web)DF and it has not expired yet, it is used.

If it is not so,

1: fetches the DF of the unk word if fetch_df is 1.

0: if guess_df is 1, guesses the DF.

if guess_df is 0, does not give TF*WebIDF weight.

db_auto => 0 || 1

If 1 is specified, (open|close)s the WebDF(Document Frequency) database automatically. This option works when tfidf method is called.

guess_df => 0 || 1

If df_file has the (Web)DF and it has not expired yet, it is used.

If it is not so,

1: guesses the DF based on the string length of $word.

0: doesn't give TF*WebIDF weight.

idf_type, api, appid, driver, df_file, expires_in, documents, Furl_HTTP, verbose

See Lingua::JA::WebIDF.

tfidf( $text || \%tf )

Calculates TF*WebIDF weight. If scalar value is set, MeCab separates the value into appropriate morphemes. If you want to use other morphological analyzers, you have to set a hash reference which contains terms and their TF(Term Frequency).

tf($text)

Calculates TF(Term Frequency) via MeCab.

idf, df, db_open, db_close, purge

See Lingua::JA::WebIDF.

AUTHOR

pawa <pawapawa@cpan.org>

SEE ALSO

Lingua::JA::TermExtractor

Lingua::JA::WebIDF

Lingua::JA::WebIDF::Driver::TokyoTyrant

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.