NAME

Text::TFIDF::Ngram - Compute the TF-IDF measure for ngram phrases

VERSION

version 0.0401

SYNOPSIS

  use Text::TFIDF::Ngram;
  my $obj = Text::TFIDF::Ngram->new(
    files => [qw( foo.txt bar.txt )],
    size  => 3,
  );
  my $w = $obj->tf( 'foo.txt', 'foo bar baz' );
  my $x = $obj->idf('foo bar baz');
  my $y = $obj->tfidf( 'foo.txt', 'foo bar baz' );
  printf "TF: %.3f, IDF: %.3f, TFIDF: %.3f\n", $w, $x, $y;
  my $z = $obj->tfidf_by_file;
  print Dumper $z;

DESCRIPTION

This module computes the TF-IDF ("term frequency-inverse document frequency") measure for a corpus of text documents.

For a working example program, please see this eg/analyze file in the distribution.

ATTRIBUTES

files

ArrayRef of filenames to use in the ngram processing.

size

Integer ngram phrase size. Default is 1.

stopwords

Boolean indicating that phrases with stopwords will be ignored. Default is 1.

punctuation

Regular expression to be used to parse-out unwanted punctuation. Giving the constructor a value of '' or 0 will override this and not exclude any characters from the results.

Default: qr/[-!"#$%&()*+,.\/\\:;<=>?@\[\]^_`{|}~]/

Note that the default does not exclude the single quote.

lowercase

Boolean to render the ngrams in lowercase. Default is 0.

digits

Boolean to exclude digits from the ngram results. Default is 0.

counts

HashRef of the ngram counts of each processed file. This is a computed attribute - providing it in the constructor will be ignored.

file_tfidf

HashRef of the TF-IDF values in each processed file. This is a computed attribute - providing it in the constructor will be ignored.

METHODS

new

  $obj = Text::TFIDF::Ngram->new(
    files       => \@files,
    size        => $size,
    stopwords   => $stopwords,
    punctuation => $punctuation,
    lowercase   => $lowercase,
    digits      => $digits,
  );

Create a new Text::TFIDF::Ngram object. If the files argument is passed in, the ngrams of each file are stored in the counts.

BUILD

Load the given file phrase counts.

tf

  $tf = $obj->tf( $file, $phrase );

Returns the frequency of the given phrase in the document file. This is not the "raw count" of the phrase, but rather the percentage of times it is seen.

idf

  $idf = $obj->idf($phrase);

Returns the inverse document frequency of a phrase.

tfidf

  $tfidf = $obj->tfidf( $file, $phrase );

Computes the TF-IDF weight for the given file and phrase. If the phrase is not in the corpus, a warning is issued and undef is returned.

tfidf_by_file()

  $tfidf = $obj->tfidf_by_file;

Construct a HashRef of all files with all phrases and their tfidf values.

AUTHOR

Gene Boggs <gene@cpan.org>

COPYRIGHT AND LICENSE

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

To install Text::TFIDF::Ngram, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::TFIDF::Ngram

CPAN shell

perl -MCPAN -e shell
install Text::TFIDF::Ngram

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)