Text::NGrammer - Pure Perl extraction of n-grams and skip-grams
use Text::NGrammer; my $s = Text::NGrammer->new; # prints [ (a,rose) (rose,is) (is,a) (a,flower) ] my @ngrams = $n->ngrams_text(2, "a rose is a flower"); print "[ "; for my $ngram (@ngrams) { print "(",$ngram->[0],",",$ngram->[1],") "; } print "]\n"; # prints [ (a,is) (rose,a) (is,flower) ] my @skipgrams = $n->skipgrams_text(2, 1, "a rose is a flower"); print "[ "; for my $skipgram (@skipgrams) { print "(",$skipgram->[0],",",$skipgram->[1],") "; } print "]\n";
The module provides a way to extract both n-grams and skip-grams from a text, a sentence or fro man array of tokens.
A n-gram is defines as an ordered sequence of tokens in a piece or text. Some frequent n-grams such as 2-grams, are also called bigrams and they represent all the ordered pairs of words in a text. For instance, the text "a rose is a flower" is composed by 4 bigrams: "a rose", "rose is", "is a", "a flower".
A skip-gram is defined as an ordered sequence of n tokens from a text with a predetermined interval k. For instance, the skip-gram with n=2 and k=1 for a piece of text are all the sequences of tokens of length 2 with interval 1 between the tokens. For instance, the text "a rose is a flower" is composed by 3 skip-grams with n=2 and k=1: "a is", "rose a", "is a", "is flower". A skip-gram with k=0 is the same of a n-gram of the same size, e.g., a 2-skip-gram with k=0 is the same of a bigram.
A broader, and better, discussion on n-grams and skip-grams can be found at https://en.wikipedia.org/wiki/N-gram.
Behind the scenes, the module uses the Lingua::Sentence module to tokenize the text in such a way that the n-grams and skip-grams never go over the boundaries of the sentences. The module provides also ways to extract the n-grams and skip-grams from sentences, i.e., without invoking Lingua::Sentence, or from an array of tokens if the application wants to make use of a custom tokenization for the text. The language to be used for the sentencer must be specified in the constructor; if not present, English is used by default.
All the methods return the n-grams and skip-grams as arrays or references to arrays of length n, where n is the specifies as a parameter of the method. Sentences, or more in general, pieces of text are not divided in n-grams skip-grams if not long enough to perform the operation. For instance, asking for all the n-grams of length 4 for the sentence "I am Francesco" returns an empty array of 4-grams because there are are only 3 tokens in the sentence.
my $ngrammer = Text::NGrammer->new(); my @ngrams = $ngrammer->ngrams_array(3, ("a", "b", "c", "d")); my $ngram = $ngrams[0]; # the first ngram print $ngram->[1]; # prints "b" my @empty = $ngrammer->ngrams_array(5, ("a", "b", "c", "d")); print "empty!" if (@empty == 0); # prints "empty!"
Creates a new Text::NGrammer object and returns it. The only parameter to accepted to the constructor is the language for the sentencer. For instance, to create a NGrammer for German the syntax is the following one
Text::NGrammer
my $german_ngrammer = Text::NGrammer->new(lang => 'de');
If no language is specified, English is assumed. The supported languages, are the ones supported by Lingua::Sentence.
Extracts all the skip-grams of length $n with interval $k from the $text. $text is broken into sentences by the module Lingua::Sentence in such a way that the skip-grams do not cross the sentence bounduaries.
$n
$k
$text
Extracts all the skip-grams of length $n with interval $k from the $sentence. $sentence is not broken into sub-sentences but only into tokens representing single words.
$sentence
Extracts all the skip-grams of length $n with interval $k from the @array. Exactly as in the case of skipgrams_sentence, the module Lingua::Sentence is not used.
@array
skipgrams_sentence
Extracts all the n-grams of length $n from the $text. $text is broken into sentences by the module Lingua::Sentence in such a way that the n-grams do not cross the sentence boundaries. This is equivalent to skipgrams_text($n, 0, $text).
skipgrams_text($n, 0, $text)
Extracts all the n-grams of length $n from the $sentence. $sentence is not broken into sub-sentences but only into tokens representing single words. This is equivalent to skipgrams_sentence($n, 0, $sentence).
skipgrams_sentence($n, 0, $sentence)
Extracts all the n-grams of length $n from the @array. Exactly as in the case of ngrams_sentence, the module Lingua::Sentence is not used. This is equivalent to skipgrams_array($n, 0, $array).
ngrams_sentence
skipgrams_array($n, 0, $array)
Initial version of the module
Fixed dependencies
Fixed dependencies in Makefile.PL
Fixed a bug for n-grams n-skipgrams with n > 2
Fixed test
Fixed meta.yml
Francesco Nidito
Copyright 2018 Francesco Nidito. All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Lingua::Sentence
To install Text::NGrammer, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::NGrammer
CPAN shell
perl -MCPAN -e shell install Text::NGrammer
For more information on module installation, please visit the detailed CPAN module installation guide.