Lingua::JA::WebIDF - WebIDF calculator
use Lingua::JA::WebIDF; my $webidf = Lingua::JA::WebIDF->new(%config); print $webidf->idf("東京"); # low print $webidf->idf("スリジャヤワルダナプラコッテ"); # high
Lingua::JA::WebIDF calculates WebIDF weight.
WebIDF(Inverse Document Frequency) weight represents the rarity of a word on the Web. The WebIDF weight of a rare word is high. Conversely, the WebIDF weight of a common word is low.
IDF is based on the intuition that a query term which occurs in many documents is not a good discriminator and should be given less weight than one which occurs in few documents.
Creates a new Lingua::JA::WebIDF instance.
The following configuration is used if you don't set %config.
KEY DEFAULT VALUE ----------- --------------- idf_type 1 api 'YahooPremium' appid undef driver 'TokyoCabinet' df_file './df.tch' fetch_df 0 expires_in 365 documents 250_0000_0000 Furl_HTTP undef verbose 1
The type1 is the most commonly cited form of IDF.
N idf(t_i) = log ----- (1) n_i N : the number of documents n_i: the number of documents which contain term t_i t_i: term
The type2 is a simple version of the RSJ weight.
N - n_i + 0.5 idf(t_i) = log ---------------- (2) n_i + 0.5
The type3 is a modification of (2).
N + 0.5 idf(t_i) = log ----------- (3) n_i + 0.5
Uses the specified Web API when fetches WebDF(Document Frequency).
Fetches and saves WebDF with the specified driver.
Saves WebDF to the specified path.
In order to reduce access to Web API, please download a big df file from http://misc.pawafuru.com/webidf/.
I recommend that you change the file depending on the type of Web API you specifies because WebDF may be different depending on it.
Never fetches WebDF from the Web if 0 is specified.
If the WebDF you want to know has already saved, it is used. If it is not so, returns undef.
If 365 is specified, WebDF expires in 365 days after fetches it.
Sets the options of Furl::HTTP->new.
If you want to use proxy server, you have to use this option.
If 1 is specified, shows verbose error messages.
Calculates the WebIDF weight of $word via df($word) method.
Fetches the WebDF of $word.
If the WebDF of $word has not been saved yet or has expired, fetches it by using the Web API you specified and saves it.
If the WebDF of $word has expired and fetch_df is 0, the expired WebDF is used.
Opens the database file which is located in $path.
If you use TokyoCabinet, you have to open the database file via this method before idf|df|db_close|purge method is called.
$mode is 'read' or 'write'.
Closes the database file which is located in $path.
This method is called automatically when the object is destroyed, so you might not need to use this method explicitly.
Purges old data in df_file.
If 365 is specified, the data which 365 days elapsed are purged.
pawa <pawapawa@cpan.org>
Lingua::JA::TFWebIDF
Lingua::JA::WebIDF::Driver::TokyoTyrant
Yahoo API: http://developer.yahoo.co.jp/
Tokyo Cabinet: http://fallabs.com/tokyocabinet/
S. Robertson, Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation 60, 503-520, 2004.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install Lingua::JA::WebIDF, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::JA::WebIDF
CPAN shell
perl -MCPAN -e shell install Lingua::JA::WebIDF
For more information on module installation, please visit the detailed CPAN module installation guide.