NAME
Lingua::JA::WebIDF - WebIDF calculator
SYNOPSIS
use Lingua::JA::WebIDF;
my $webidf = Lingua::JA::WebIDF->new(%config);
print $webidf->idf("東京"); # low
print $webidf->idf("スリジャヤワルダナプラコッテ"); # high
DESCRIPTION
Lingua::JA::WebIDF calculates WebIDF weight.
WebIDF(Inverse Document Frequency) weight represents the rarity of a word on the Web. The WebIDF weight of a rare word is high. Conversely, the WebIDF weight of a common word is low.
IDF is based on the intuition that a query term which occurs in many documents is not a good discriminator and should be given less weight than one which occurs in few documents.
METHODS
new( %config || \%config )
Creates a new Lingua::JA::WebIDF instance.
The following configuration is used if you don't set %config.
KEY DEFAULT VALUE
----------- ---------------
idf_type 1
api 'YahooPremium'
appid undef
driver 'TokyoCabinet'
df_file './df.tch'
fetch_df 0
expires_in 365
documents 250_0000_0000
Furl_HTTP undef
verbose 1
- idf_type => 1 || 2 || 3
-
The type1 is the most commonly cited form of IDF.
N idf(t_i) = log ----- (1) n_i N : the number of documents n_i: the number of documents which contain term t_i t_i: term
The type2 is a simple version of the RSJ weight.
N - n_i + 0.5 idf(t_i) = log ---------------- (2) n_i + 0.5
The type3 is a modification of (2).
N + 0.5 idf(t_i) = log ----------- (3) n_i + 0.5
- api => 'Yahoo' || 'YahooPremium'
-
Uses the specified Web API when fetches WebDF(Document Frequency).
- driver => 'Storable' || 'TokyoCabinet'
-
Fetches and saves WebDF with the specified driver.
- df_file => $path
-
Saves WebDF to the specified path.
In order to reduce access to Web API, please download a big df file from http://misc.pawafuru.com/webidf/.
I recommend that you change the file depending on the type of Web API you specifies because WebDF may be different depending on it.
- fech_df => 0
-
Never fetches WebDF from the Web if 0 is specified.
If the WebDF you want to know has already saved, it is used. If it is not so, returns undef.
- expires_in => $days
-
If 365 is specified, WebDF expires in 365 days after fetches it.
- Furl_HTTP => \%option
-
Sets the options of Furl::HTTP->new.
If you want to use proxy server, you have to use this option.
- verbose => 1 || 0
-
If 1 is specified, shows verbose error messages.
idf($word)
Calculates the WebIDF weight of $word via df($word) method.
df($word)
Fetches the WebDF of $word.
If the WebDF of $word has not been saved yet or has expired, fetches it by using the Web API you specified and saves it.
If the WebDF of $word has expired and fetch_df is 0, the expired WebDF is used.
db_open($mode)
Opens the database file which is located in $path.
If you use TokyoCabinet, you have to open the database file via this method before idf|df|db_close|purge method is called.
$mode is 'read' or 'write'.
db_close
Closes the database file which is located in $path.
This method is called automatically when the object is destroyed, so you might not need to use this method explicitly.
purge($expires_in)
Purges old data in df_file.
If 365 is specified, the data which 365 days elapsed are purged.
AUTHOR
pawa <pawapawa@cpan.org>
SEE ALSO
Lingua::JA::WebIDF::Driver::TokyoTyrant
Yahoo API: http://developer.yahoo.co.jp/
Tokyo Cabinet: http://fallabs.com/tokyocabinet/
S. Robertson, Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation 60, 503-520, 2004.
LICENSE
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.