Text::Ngram::LanguageDetermine - Guess the language of text using ngrams
use Text::Ngram::LanguageDetermine;
NOTE: First build some language profiles using source text, (easily obtained from places like Wikipedia or else where, subject matter really doesn't matter but, it should all be in the target language and saved in UTF-8 format).
my %lang_profiles = ( english => create_language_profile(source_filename => 'english.txt'), french => create_language_profile(source_filename => 'french.txt'), german => create_language_profile(source_filename => 'german.txt'), );
Get the profile of the text we're wondering about.
my $text_profile = make_text_profile(source_filename => 'query.txt');
Score the text profile against all of the language profiles.
my %scores = map { $_ => compare_profiles(language_profile => $language_profiles{$_}, text_profile => $text_profile) } keys %lang_profiles;
The score thats the smallest is the most likely answer, the score values themselves aren't actually relevant, just the ordering of the scores. lowest score = most likey, highest score = most unlikely.
print "Language is: " . (sort { $scores{$a} <=> $scores{$b} } keys $scores)[0] . "\n";
This module performs the task of guessing what language a document is written in using an ngram profile of a large sample text of each language and the query text.
It does this by calculating the most frequent ngrams of the sample text for a language, ranking them by frequency then only keeping the most popular ngrams removing most subject specific ngrams. Then it compares the positions of the ngrams from the language sample text to the positions of the ngrams from the query text ranked by frequency to produce a score that indiciates the "Out of Place" measure. This measure determines how much the query text's ngrams are out of place with regard to a languages ngrams.
The language that produces the lowest "Out of Place" measure its most likely the language the text is written in.
This module was written after reading the paper "N-Gram-Based Text Categorization" by William B. Cavnar and John M. Trenkle, see: http://citeseer.csail.mit.edu/68861.html
Might be some.
Please contact the author with any patches, bug reports via email.
Rusty Conover CPAN ID: RCONOVER InfoGears Inc. rconover@infogears.com http://www.infogears.com
Copyright 2005 InfoGears Inc. http://www.infogears.com All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
perl(1). Text::Ngram
Usage : create_language_profile(source_filename => 'english.sample', destination_filename => 'english.profile', frequency_cutoff => 300, ngram_max_length => 5). Purpose : This function creates a language profile for future comparision to text, its best to pass in a good 10k to 20k byte sample of the language. It reads that data, creates ngrams of various lengths from 1 to ngram_max_length and calculates the frequency of each ngram throughout the entire text. After all of the ngrams have been created and sorted by their frequency it truncates the list to frequency_cutoff entries. The frequency_cuttoff serves the purpose of only keeping the ngrams that really aren't subject specific to the text, 300 seems to be a good default but its open to tuning. ngram_max_length is the maximum length of a ngram to be generated. Again this length is open to tuning, but 5 characters seems to be a good number. Returns : This function returns a hash referench of ngrams with ngrams as the key and their frequency rank as the value. Argument : source_filename - the filename where to read the source text source_data - a scalar that contains the source text destination_filename - the filename where to store the profile using Storable. frequency_cutoff - the cutoff frequency of the ngrams ngram_max_length - the maximum length of the ngrams. Throws : uses Carp::confess to complain about errors.
Usage : create_text_profile(source_filename => 'interesting.txt', ngram_max_length => 5) Purpose : This function creates the comparison profile for an arbitrary piece of text, it calculates all of the ngrams for the text and then sorts them by frequency. Returns : An array reference containing the ngrams sorted by frequency of occurrance in the passed text. Argument : source_filename - The filename where to read the source text source_data - The data to use as the source passed as a scalar ngram_max_length - The maximum ngram length Throws : uses Carp::confess for errors.
Usage : compare_profiles(comparison_profile => $compare_profile, language_profile => $lang_profile) Purpose : This function compares a language profile to a text profile and calculates a score determining if the text's ngram frequency matches well with the language's frequency. This is called the "Out-of-Place" measure. Returns : An integer score measuring how much the ngrams in the text profile are out of place with the ngrams in the language profile. Argument : comparison_profile - the comparison profile created by create_text_profile() language_profile - the language profile created by create_language_profile() ngram_not_found_distance - the distance value used for ngrams not found in the language profile, by default this is 2 * the total ngrams in the language profile. Throws : uses Carp::confess for bad arguments. Comments : To determine which language the text is written in, the best guess of this algorithm is the language with the lowest score returned by this function.
To install Text::Ngram::LanguageDetermine, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Ngram::LanguageDetermine
CPAN shell
perl -MCPAN -e shell install Text::Ngram::LanguageDetermine
For more information on module installation, please visit the detailed CPAN module installation guide.