The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::EN::Ngram - Extract n-grams from texts and list them according to frequency and/or T-Score

SYNOPSIS

  # initalize
  use Lingua::EN::Ngram;
  $ngram = Lingua::EN::Ngram->new( file => './etc/walden.txt' );

  # calculate t-score; t-score is only available for bigrams
  $tscore = $ngram->tscore;
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {

    print "$$tscore{ $_ }\t" . "$_\n";

  }

  # list trigrams according to frequency
  $trigrams = $ngram->ngram( 3 );
  foreach my $trigram ( sort { $$trigrams{ $b } <=> $$trigrams{ $a } } keys %$trigrams ) {

      print $$trigrams{ $trigram }, "\t$trigram\n";

  }

DESCRIPTION

This module is designed to extract n-grams from texts and list them according to frequency and/or T-Score.

To elaborate, the purpose of Lingua::EN::Ngram is to: 1) pull out all of the ngrams (multi-word phrases) in a given text, and 2) list these phrases according to their frequency. Using this module is it possible to create lists of the most common phrases in a text as well as order them by their probable occurance, thus implying significance. This process is useful for the purposes of textual analysis and "distant reading".

The two-word phrases (bigrams) are also listable by their T-Score. The T-Score, as well as a number of the module's other methods, is calculated as per Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer.

Finally, the intersection method enables the developer to find ngrams common in an arbitrary number of texts. Use this to look for common themes across a corpus.

METHODS

new

Create a new Lingua::EN::Ngram object:

  # initalize
  $ngram = Lingua::EN::Ngram->new;

new( text => $scalar )

Create a new Lingua::EN::Ngram object whose contents equal the content of a scalar:

  # initalize with scalar
  $ngram = Lingua::EN::Ngram->new( text => 'All good things must come to an end...' );

new( file => $scalar )

Create a new Lingua::EN::Ngram object whose contents equal the content of a file:

  # initalize with file
  $ngram = Lingua::EN::Ngram->new( file => './etc/rivers.txt' );

text

Set or get the text to be analyzed:

  # fill Lingua::EN::Ngram object with content 
  $ngram->text( 'All good things must come to an end...' );

  # get the Lingua::EN::Bigram object's content 
  $text = $ngram->text;

tscore

Return a reference to a hash whose keys are a bigram and whose values are a T-Score -- a probabalistic calculation determining the significance of the bigram occuring in the text:

  # get t-score
  $tscore = $ngrams->tscore;

  # list bigrams according to t-score
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {

          print "$$tscore{ $_ }\t" . "$_\n";

  }

T-Score can only be computed against bigrams.

ngram( $scalar )

Return a hash reference whose keys are ngrams of length $scalar and whose values are the number of times the ngrams appear in the text:

  # create a list of trigrams
  $trigrams = $ngrams->ngram( 3 );

  # display frequency
  foreach ( sort { $$trigrams{ $b } <=> $$trigrams{ $a } } keys %$trigrams ) {

    print $$trigrams{ $_ }, "\t$_\n";

  }

This method requires a single parameter and that parameter must be an integer. For example, to get a list of bigrams, pass 2 to ngram. To get a list of quadgrams, pass 4.

intersection( corpus => [ @array ], length => $scalar )

Return a hash reference whose keys are ngrams of length $scalar and whose values are the number of times the ngrams appear in a corpus of texts:

  # build corpus
  $walden = Lingua::EN::Ngram->new( file => './etc/walden.txt' );
  $rivers = Lingua::EN::Ngram->new( file => './etc/rivers.txt' );
  $corpus = Lingua::EN::Ngram->new;

  # compute intersections
  $intersections = $corpus->intersection( corpus => [ ( $walden, $rivers ) ], length => 5 );

  # display frequency
  foreach ( sort { $$intersections{ $b } <=> $$intersections{ $a }} keys %$intersections ) {

    print $$intersections{ $_ }, "\t$_\n";

  }

The value of corpus must be an array reference, and each element must be Lingua::EN::Ngram objects. The value of length must be an integer.

DISCUSSION

Given the increasing availability of full text materials, this module is intended to help "digital humanists" apply mathematical methods to the analysis of texts. For example, the developer can extract the high-frequency words using the ngram method and allow the user to search for those words in a concordance. The use of ngram( 2 ) simply returns the frequency of bigrams in a text, but the tscore method can order them in a more finely tuned manner.

Consider using T-Score-weighted bigrams as classification terms to supplement the "aboutness" of texts. Concatonate many texts together and look for common phrases written by the author. Compare these commonly used phrases to the commonly used phrases of other authors.

All ngrams return by the ngram method include punctuation. This is intentional. Developers may need want to remove ngrams containing such values from the output. Similarly, no effort has been made to remove commonly used words -- stop words -- from the methods. Consider the use of Lingua::StopWords, Lingua::EN::StopWords, or the creation of your own stop word list to make output more meaningful. The distribution came with a script (bin/ngrams.pl) demonstrating how to remove puncutation from the displayed output. Another script (bin/intesections.pl) demonstrates how to extract and count ngrams across two texts.

Finally, this is not the only module supporting ngram extraction. See also Text::NSP.

TODO

There are probably a number of ways the module can be improved:

    * the distribution's license should probably be changed to the Perl Aristic License

    * the addition of alternative T-Score calculations would be nice

    * make sure the module works with character sets beyond ASCII (done, I think, as of version 0.02)

CHANGES

    * March 28, 2018 (version 0.03) - removed lower casing of letters and install ngrams script

    * November 25, 2010 (version 0.02) - added non-Latin characters

    * September 12, 2010 (version 0.01) - initial release but an almost complete rewrite of Lingua::EN::Bigram

ACKNOWLEDGEMENTS

T-Score, as well as a number of the module's methods, is calculated as per Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer.

AUTHOR

Eric Lease Morgan <eric_morgan@infomotions.com>