Ngram Statistics Package Todo list


The following list describes some of the features that we'd like to include in NSP in future. No particular priority is assigned to these items - they are all things we've discussed amongst ourselves or with users and agree would be good to add.

If you have additional ideas, or would like to comment on something on the current list, please let me know at


Right now all the ngrams being counted are stored in memory. Each ngram is an element in a hash. This is ok for up to a few million word corpora, but after that things really slow down. We would like to pursue the idea of using suffix trees which would greatly improve space utilization.

The use of suffix trees for counting term frequencies is based on :

Yamamoto, M. and Church, K (2001) “Using Suffix Arrays to compute Term Frequency and Document Frequency for All Substrings in a Corpus,” Computational Linguistics, vol 27:1, pp. 1-30, MIT Press.

Find the article at:

In fact, they even provide a C implementation:

However, we would convert this into Perl and may need to modify it somewhat to fit into NSP.

Another alternative would be to simply modify the program such that rather than using memory it used disk space to accumulate counts. This would be very slow but might suffice for certain situations.

Regardless of the changes we make to counting, would continue to support counting in memory, which is perfectly adequate for smaller amounts of corpora.


The web is a huge source of text, and we could get counts for words or ngrams from the web (probably using something like Perl LWP module).

Rather than running on a particular body of text (as is the case now) we'd probably have to run such that it looked for counts for a specific set of words as found on the web. Simply running on the entire www wouldn't really make sense. So perhaps we would run count on one sample to get a list of the word types/ngrams that we are interested in, and then run count on the www to find out their respective counts.

[Our interest in this has been inspired by both Peter Turney (ACL-02 paper) and Frank Keller (EMNLP-02 paper).]


NSP is geared for the Roman alphabet. Perl has increasingly better Unicode support with each passing release, and we will incorporate Unicode support in future. We attempted to use the Unicode features in Perl 5.6, but found them to be incomplete. We have not yet attempted this with Perl 5.8 (the now current version) but it is said to be considerably better.

Perl support for unicode will include language / alphabet specific definitions of regular expression character classes like \d+ or \w+ (digits and non-white space characters). So you should be able to use (in theory) the same regular expression definitions with any alphabet and have it match in a way that makes sense for that language.

Our expertise in this area is fairly limited, so please let us know if we are missing something obvious or misunderstanding what Perl is attempting to do.


When processing large files, gives no indication of how much of the file has been processed, or even if it is still making progress. A "progress meter" could show how much of the file has been proceeded, or how many ngrams have been counted, or something to indicate that progress is being made.


If encounters a very long line of text (with literally thousands and thousands of words on a single line) it may operate very very slowly. It would be good to let a user know that an overly long line (we'd need to define more precisely what "overly long" is) is being processed (this fits into the progress meter mentioned above) so that a user can decide if they want to continue with this, or possibly terminate processing and reformat the input file.

GENERALIZE --newLine in

The --newLine switch tells that Ngrams may not cross over end of line markers. Presumably this would be used when each line of text consists of a sentence (thus the end of a line also marks the end of a sentence). However, if the text is not formatted and there may be multiple sentences per line, or sentences may extend across several lines, we may want to allow --newLine to include other characters that Ngrams would not be allowed to cross.

For example we could have the switch --dontCross "\n\.,;\?" which would prevent ngrams from crossing the newline, the fullstop, the comma, the semicolon and the question mark.


Our current --recurse option creates a single count output file for all the words in all the texts found in a directory structure. We might want to be able to process all the files in a directory structure such that each file is treated separately and a separate count file is created for it.

For example, suppose we have the directory /txts that contains the files text1 and text2. --recurse output txts

output will consist of the combined counts from txts/text1 and txts/text2.

This new option would count these files separately and produce separate count output files.


What about having a frequency cutoff for that removed any ngrams that occur more than some number of times? The idea here would be to eliminate high frequency ngrams not through the use of a stoplist but rather through a frequency cutoff, based on the presumption that most very high frequent ngrams will be made up of stop words.

What about a percentage cutoff? In other words, eliminate the least (or most) frequent ngrams?


It would be useful to allow NSP to automatically create a stoplist based on a combination of frequency counts and/or scores like tf/idf. While tf/idf depends on the idea of a document, we would simply chunk up a large corpus into 100 token long pieces, and consider each piece a document, and consider stop words those words that occur in some number of these chunks.


Right now and operate such that the output file is designated first, followed by the input file.

For example, outputfile inputfile

However, there are advantages to allowing a user to redirect input and output, particularly in the Unix and MS-DOS world. As Derek Jones pointed out to us, if we have Windows users they are probably looking for a GUI (and they won't find much will they!!). This would enable the use of syntax such as... input > out

 cat input | > outfile

which would help in building scripts, etc.


Rather than have user set paths, have a script that would ask the users questions to set things up properly. This might be especially useful if we want to maintain the "old" style of output input file specifications in and (see point above) as well as STDIN STDOUT. (Maybe a user could pick which one?) In addition, there may be other options that a user could specify this way (such as a default token definition, home directory, etc.)

EXTEND to Ngrams

At present is only able to count bigrams. It would be very useful to extend it so that it could count Ngrams in general. Also, there is no support for windowing provided at present, so the bigrams it counts must be adjacent. It would be desirable to support windowing for bigrams and Ngrams generally.


At present all programs simply exit when they encounter an error. We will return an error code that can be detected by the calling program, so that abnormal termination is clear. This affects and particularly, but will also be changed in, and


There is a certain amount of redundant code in, and It would be useful to make these more modular, to allow for inheritance and code sharing, as well as the use of objects (potentially).

make program distribution ready (Amruta). This program takes marginal totals as input, and produces the internal cell counts that led to these marginals (but that are not shown in NSP output). This is useful for converting output for testing in another system (SAS, for example).


Ted Pedersen,

Last updated 06/21/06 by TDP



 home page:

 mailing list:


Copyright (C) 2000-2006 Ted Pedersen

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

Note: a copy of the GNU Free Documentation License is available on the web at and is included in this distribution as FDL.txt.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 28:

Non-ASCII character seen before =encoding in '“Using'. Assuming UTF-8