Author image Francesco Nidito

NAME

Text::NGrammer - Pure Perl extraction of n-grams and skip-grams

SYNOPSIS

 use Text::NGrammer;
 my $s = Text::NGrammer->new;
 
 # prints [ (a,rose) (rose,is) (is,a) (a,flower) ]
 my @ngrams = $n->ngrams_text(2, "a rose is a flower");
 print "[ ";
 for my $ngram (@ngrams) {
   print "(",$ngram->[0],",",$ngram->[1],") ";
 }
 print "]\n";
 
 # prints [ (a,is) (rose,a) (is,flower) ]
 my @skipgrams = $n->skipgrams_text(2, 1, "a rose is a flower");
 print "[ ";
 for my $skipgram (@skipgrams) {
   print "(",$skipgram->[0],",",$skipgram->[1],") ";
 }
 print "]\n";

DESCRIPTION

The module provides a way to extract both n-grams and skip-grams from a text, a sentence or fro man array of tokens.

A n-gram is defines as an ordered sequence of tokens in a piece or text. Some frequent n-grams such as 2-grams, are also called bigrams and they represent all the ordered pairs of words in a text. For instance, the text "a rose is a flower" is composed by 4 bigrams: "a rose", "rose is", "is a", "a flower".

A skip-gram is defined as an ordered sequence of n tokens from a text with a predetermined interval k. For instance, the skip-gram with n=2 and k=1 for a piece of text are all the sequences of tokens of length 2 with interval 1 between the tokens. For instance, the text "a rose is a flower" is composed by 3 skip-grams with n=2 and k=1: "a is", "rose a", "is a", "is flower". A skip-gram with k=0 is the same of a n-gram of the same size, e.g., a 2-skip-gram with k=0 is the same of a bigram.

A broader, and better, discussion on n-grams and skip-grams can be found at https://en.wikipedia.org/wiki/N-gram.

Behind the scenes, the module uses the Lingua::Sentence module to tokenize the text in such a way that the n-grams and skip-grams never go over the boundaries of the sentences. The module provides also ways to extract the n-grams and skip-grams from sentences, i.e., without invoking Lingua::Sentence, or from an array of tokens if the application wants to make use of a custom tokenization for the text. The language to be used for the sentencer must be specified in the constructor; if not present, English is used by default.

All the methods return the n-grams and skip-grams as arrays or references to arrays of length n, where n is the specifies as a parameter of the method. Sentences, or more in general, pieces of text are not divided in n-grams skip-grams if not long enough to perform the operation. For instance, asking for all the n-grams of length 4 for the sentence "I am Francesco" returns an empty array of 4-grams because there are are only 3 tokens in the sentence.

 my $ngrammer = Text::NGrammer->new();
 
 my @ngrams = $ngrammer->ngrams_array(3, ("a", "b", "c", "d"));
 my $ngram = $ngrams[0]; # the first ngram
 print $ngram->[1]; # prints "b"
 
 my @empty = $ngrammer->ngrams_array(5, ("a", "b", "c", "d"));
 print "empty!" if (@empty == 0); # prints "empty!"

METHODS

new(%)

Creates a new Text::NGrammer object and returns it. The only parameter to accepted to the constructor is the language for the sentencer. For instance, to create a NGrammer for German the syntax is the following one

 my $german_ngrammer = Text::NGrammer->new(lang => 'de');

If no language is specified, English is assumed. The supported languages, are the ones supported by Lingua::Sentence.

skipgrams_text($n, $k, $text)

Extracts all the skip-grams of length $n with interval $k from the $text. $text is broken into sentences by the module Lingua::Sentence in such a way that the skip-grams do not cross the sentence bounduaries.

skipgrams_sentence($n, $k, $sentence)

Extracts all the skip-grams of length $n with interval $k from the $sentence. $sentence is not broken into sub-sentences but only into tokens representing single words.

skipgrams_array($n, $k, @array)

Extracts all the skip-grams of length $n with interval $k from the @array. Exactly as in the case of skipgrams_sentence, the module Lingua::Sentence is not used.

ngrams_text($n, $text)

Extracts all the n-grams of length $n from the $text. $text is broken into sentences by the module Lingua::Sentence in such a way that the n-grams do not cross the sentence boundaries. This is equivalent to skipgrams_text($n, 0, $text).

ngrams_sentence($n, $sentence)

Extracts all the n-grams of length $n from the $sentence. $sentence is not broken into sub-sentences but only into tokens representing single words. This is equivalent to skipgrams_sentence($n, 0, $sentence).

ngrams_array($n, @array)

Extracts all the n-grams of length $n from the @array. Exactly as in the case of ngrams_sentence, the module Lingua::Sentence is not used. This is equivalent to skipgrams_array($n, 0, $array).

HISTORY

0.01

Initial version of the module

0.02

Fixed dependencies

0.03

Fixed dependencies in Makefile.PL

AUTHOR

Francesco Nidito

COPYRIGHT

Copyright 2018 Francesco Nidito. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

Lingua::Sentence