The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::JA::WebIDF - WebIDF calculator

SYNOPSIS

  use Lingua::JA::WebIDF;

  my $webidf = Lingua::JA::WebIDF->new(%config);

  print $webidf->idf("東京"); # low
  print $webidf->idf("スリジャヤワルダナプラコッテ"); # high

DESCRIPTION

Lingua::JA::WebIDF calculates WebIDF weight.

WebIDF(Inverse Document Frequency) weight represents the rarity of a word on the Web. The WebIDF weight of a rare word is high. Conversely, the WebIDF weight of a common word is low.

IDF is based on the intuition that a query term which occurs in many documents is not a good discriminator and should be given less weight than one which occurs in few documents.

METHODS

new( %config || \%config )

Creates a new Lingua::JA::WebIDF instance.

The following configuration is used if you don't set %config.

  KEY                 DEFAULT VALUE
  -----------         ---------------
  idf_type            1
  api                 'YahooPremium'
  appid               undef
  driver              'TokyoCabinet'
  df_file             './df.tch'
  fetch_df            0
  expires_in          365
  documents           250_0000_0000
  Furl_HTTP           undef
  verbose             1
idf_type => 1 || 2 || 3

The type1 is the most commonly cited form of IDF.

                   N
  idf(t_i) = log -----  (1)
                  n_i

  N  : the number of documents
  n_i: the number of documents which contain term t_i
  t_i: term

The type2 is a simple version of the RSJ weight.

                   N - n_i + 0.5
  idf(t_i) = log ----------------  (2)
                    n_i + 0.5

The type3 is a modification of (2).

                   N + 0.5
  idf(t_i) = log -----------  (3)
                  n_i + 0.5
api => 'Yahoo' || 'YahooPremium'

Uses the specified Web API when fetches WebDF(Document Frequency).

driver => 'Storable' || 'TokyoCabinet'

Fetches and saves WebDF with the specified driver.

df_file => $path

Saves WebDF to the specified path.

In order to reduce access to Web API, please download a big df file from http://misc.pawafuru.com/webidf/.

I recommend that you change the file depending on the type of Web API you specifies because WebDF may be different depending on it.

fech_df => 0

Never fetches WebDF from the Web if 0 is specified.

If the WebDF you want to know has already saved, it is used. If it is not so, returns undef.

expires_in => $days

If 365 is specified, WebDF expires in 365 days after fetches it.

Furl_HTTP => \%option

Sets the options of Furl::HTTP->new.

If you want to use proxy server, you have to use this option.

verbose => 1 || 0

If 1 is specified, shows verbose error messages.

idf($word)

Calculates the WebIDF weight of $word via df($word) method.

df($word)

Fetches the WebDF of $word.

If the WebDF of $word has not been saved yet or has expired, fetches it by using the Web API you specified and saves it.

If the WebDF of $word has expired and fetch_df is 0, the expired WebDF is used.

db_open($mode)

Opens the database file which is located in $path.

If you use TokyoCabinet, you have to open the database file via this method before idf|df|db_close|purge method is called.

$mode is 'read' or 'write'.

db_close

Closes the database file which is located in $path.

This method is called automatically when the object is destroyed, so you might not need to use this method explicitly.

purge($expires_in)

Purges old data in df_file.

If 365 is specified, the data which 365 days elapsed are purged.

AUTHOR

pawa <pawapawa@cpan.org>

SEE ALSO

Lingua::JA::TFWebIDF

Lingua::JA::WebIDF::Driver::TokyoTyrant

Yahoo API: http://developer.yahoo.co.jp/

Tokyo Cabinet: http://fallabs.com/tokyocabinet/

S. Robertson, Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation 60, 503-520, 2004.

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.