The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::NGrammer - Pure Perl extraction of n-grams and skip-grams

SYNOPSIS

 use Text::NGrammer;
 my $s = Text::NGrammer->new;
 
 # prints [ (a,rose) (rose,is) (is,a) (a,flower) ]
 my @ngrams = $n->ngrams_text(2, "a rose is a flower");
 print "[ ";
 for my $ngram (@ngrams) {
   print "(",$ngram->[0],",",$ngram->[1],") ";
 }
 print "]\n";
 
 # prints [ (a,is) (rose,a) (is,flower) ]
 my @skipgrams = $n->skipgrams_text(2, 1, "a rose is a flower");
 print "[ ";
 for my $skipgram (@skipgrams) {
   print "(",$skipgram->[0],",",$skipgram->[1],") ";
 }
 print "]\n";

DESCRIPTION

The module provides a way to extract both n-grams and skip-grams from a text, a sentence or fro man array of tokens.

A n-gram is defines as an ordered sequence of tokens in a piece or text. Some frequent n-grams such as 2-grams, are also called bigrams and they represent all the ordered pairs of words in a text. For instance, the text "a rose is a flower" is composed by 4 bigrams: "a rose", "rose is", "is a", "a flower".

A skip-gram is defined as an ordered sequence of n tokens from a text with a predetermined interval k. For instance, the skip-gram with n=2 and k=1 for a piece of text are all the sequences of tokens of length 2 with interval 1 between the tokens. For instance, the text "a rose is a flower" is composed by 3 skip-grams with n=2 and k=1: "a is", "rose a", "is a", "is flower". A skip-gram with k=0 is the same of a n-gram of the same size, e.g., a 2-skip-gram with k=0 is the same of a bigram.

A broader, and better, discussion on n-grams and skip-grams can be found at https://en.wikipedia.org/wiki/N-gram.

Behind the scenes, the module uses the Lingua::Sentence module to tokenize the text in such a way that the n-grams and skip-grams never go over the boundaries of the sentences. The module provides also ways to extract the n-grams and skip-grams from sentences, i.e., without invoking Lingua::Sentence, or from an array of tokens if the application wants to make use of a custom tokenization for the text. The language to be used for the sentencer must be specified in the constructor; if not present, English is used by default.

All the methods return the n-grams and skip-grams as arrays or references to arrays of length n, where n is the specifies as a parameter of the method. Sentences, or more in general, pieces of text are not divided in n-grams skip-grams if not long enough to perform the operation. For instance, asking for all the n-grams of length 4 for the sentence "I am Francesco" returns an empty array of 4-grams because there are are only 3 tokens in the sentence.

 my $ngrammer = Text::NGrammer->new();
 
 my @ngrams = $ngrammer->ngrams_array(3, ("a", "b", "c", "d"));
 my $ngram = $ngrams[0]; # the first ngram
 print $ngram->[1]; # prints "b"
 
 my @empty = $ngrammer->ngrams_array(5, ("a", "b", "c", "d"));
 print "empty!" if (@empty == 0); # prints "empty!"

METHODS

new(%)

Creates a new Text::NGrammer object and returns it. The only parameter to accepted to the constructor is the language for the sentencer. For instance, to create a NGrammer for German the syntax is the following one

 my $german_ngrammer = Text::NGrammer->new(lang => 'de');

If no language is specified, English is assumed. The supported languages, are the ones supported by Lingua::Sentence.

skipgrams_text($n, $k, $text)

Extracts all the skip-grams of length $n with interval $k from the $text. $text is broken into sentences by the module Lingua::Sentence in such a way that the skip-grams do not cross the sentence bounduaries.

skipgrams_sentence($n, $k, $sentence)

Extracts all the skip-grams of length $n with interval $k from the $sentence. $sentence is not broken into sub-sentences but only into tokens representing single words.

skipgrams_array($n, $k, @array)

Extracts all the skip-grams of length $n with interval $k from the @array. Exactly as in the case of skipgrams_sentence, the module Lingua::Sentence is not used.

ngrams_text($n, $text)

Extracts all the n-grams of length $n from the $text. $text is broken into sentences by the module Lingua::Sentence in such a way that the n-grams do not cross the sentence boundaries. This is equivalent to skipgrams_text($n, 0, $text).

ngrams_sentence($n, $sentence)

Extracts all the n-grams of length $n from the $sentence. $sentence is not broken into sub-sentences but only into tokens representing single words. This is equivalent to skipgrams_sentence($n, 0, $sentence).

ngrams_array($n, @array)

Extracts all the n-grams of length $n from the @array. Exactly as in the case of ngrams_sentence, the module Lingua::Sentence is not used. This is equivalent to skipgrams_array($n, 0, $array).

HISTORY

0.01

Initial version of the module

0.02

Fixed dependencies

0.03

Fixed dependencies in Makefile.PL

0.04

Fixed a bug for n-grams n-skipgrams with n > 2

0.05

Fixed test

0.06

Fixed meta.yml

AUTHOR

Francesco Nidito

COPYRIGHT

Copyright 2018 Francesco Nidito. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

Lingua::Sentence