Lingua::JA::TFWebIDF - TF*WebIDF calculator
use Lingua::JA::TFWebIDF; use utf8; use feature qw/say/; use Data::Printer; my $tfidf = Lingua::JA::TFWebIDF->new( api => 'YahooPremium', appid => $appid, fetch_df => 1, Furl_HTTP => { timeout => 3 }, driver => 'TokyoCabinet', df_file => './yahoo.tch', pos1_filter => [qw/非自立 代名詞 数 ナイ形容詞語幹 副詞可能 サ変接続/], term_length_min => 2, tf_min => 2, df_min => 1_0000, df_max => 500_0000, ng_word => [qw/編集 本人 自身 自分 たち さん/], fetch_unk_word_df => 0, concatenation_max => 100, ); my %tf = ( '自然言語処理' => 9, '自然言語' => 6, '自然言語理解' => 4, '処理' => 5, '解析' => 4, ); p $tfidf->tfidf(\%tf)->dump; p $tfidf->tfidf($text)->dump; p $tfidf->tf($text)->dump; for my $result (@{ $tfidf->tfidf($text)->list(20) }) { my ($word, $score) = each %{$result}; say "$word: $score"; }
Lingua::JA::TFWebIDF calculates TF*WebIDF scores.
Compared with Lingua::JA::TFIDF, this module has the following advantages.
supports Tokyo Cabinet, Bing API and many options.
tfidf function accepts \%tf. (This eases the use of other morphological analyzers.)
Creates a new Lingua::JA::TFWebIDF instance.
The following configuration is used if you don't set %config.
KEY DEFAULT VALUE ----------- --------------- pos1_filter [qw/非自立 代名詞 数 ナイ形容詞語幹 副詞可能 接尾/] pos2_filter [] pos3_filter [] ng_word [] term_length_min 2 term_length_max 30 concatenation_max 30 tf_min 1 df_min 0 df_max 250_0000_0000 fetch_unk_word_df 0 idf_type 1 api 'Yahoo' appid undef driver 'Storable' df_file undef fetch_df 1 expires_in 365 documents 250_0000_0000 Furl_HTTP undef
The filters of '品詞細分類'.
The maximum value of the number of term concatenations.
If 2 is specified, 2 consecutive nouns are concatenated. I recommend that you specify a large value or 0.
If half width spaces or tabs are ignored, you need to replace them with full width spaces.
1: Fetches the DF score of a word which exists in the dictionary of MeCab if DF score of its word is not fetched yet.
0: The average DF score is used.
'unk word' is a word which not exists in the dictionary of MeCab.
1: If fetch_df is 1, fetches DF score of unk word.
See Lingua::JA::WebIDF.
Calculates TF*WebIDF score. If scalar value is set, MeCab separates the value into appropriate morphemes. If you want to use other morphological analyzers, you have to set a hash reference which contains terms and their TF scores.
Calculates TF score via MeCab.
pawa <pawapawa@cpan.org>
Lingua::JA::WebIDF
Lingua::JA::WebIDF::Driver::TokyoTyrant
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install Lingua::JA::TFWebIDF, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::JA::TFWebIDF
CPAN shell
perl -MCPAN -e shell install Lingua::JA::TFWebIDF
For more information on module installation, please visit the detailed CPAN module installation guide.