The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::Ngram::LanguageDetermine - Guess the language of text using ngrams

SYNOPSIS

  use Text::Ngram::LanguageDetermine;

NOTE: First build some language profiles using source text, (easily obtained from places like Wikipedia or else where, subject matter really doesn't matter but, it should all be in the target language and saved in UTF-8 format).

my %lang_profiles = ( english => create_language_profile(source_filename => 'english.txt'), french => create_language_profile(source_filename => 'french.txt'), german => create_language_profile(source_filename => 'german.txt'), );

Get the profile of the text we're wondering about.

my $text_profile = make_text_profile(source_filename => 'query.txt');

Score the text profile against all of the language profiles.

my %scores = map { $_ => compare_profiles(language_profile => $language_profiles{$_}, text_profile => $text_profile) } keys %lang_profiles;

The score thats the smallest is the most likely answer, the score values themselves aren't actually relevant, just the ordering of the scores. lowest score = most likey, highest score = most unlikely.

print "Language is: " . (sort { $scores{$a} <=> $scores{$b} } keys $scores)[0] . "\n";

DESCRIPTION

This module performs the task of guessing what language a document is written in using an ngram profile of a large sample text of each language and the query text.

It does this by calculating the most frequent ngrams of the sample text for a language, ranking them by frequency then only keeping the most popular ngrams removing most subject specific ngrams. Then it compares the positions of the ngrams from the language sample text to the positions of the ngrams from the query text ranked by frequency to produce a score that indiciates the "Out of Place" measure. This measure determines how much the query text's ngrams are out of place with regard to a languages ngrams.

The language that produces the lowest "Out of Place" measure its most likely the language the text is written in.

This module was written after reading the paper "N-Gram-Based Text Categorization" by William B. Cavnar and John M. Trenkle, see: http://citeseer.csail.mit.edu/68861.html

BUGS

Might be some.

SUPPORT

Please contact the author with any patches, bug reports via email.

AUTHOR

        Rusty Conover
        CPAN ID: RCONOVER
        InfoGears Inc.
        rconover@infogears.com
        http://www.infogears.com

COPYRIGHT

Copyright 2005 InfoGears Inc. http://www.infogears.com All Rights Reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

SEE ALSO

perl(1). Text::Ngram

create_language_profile

 Usage : create_language_profile(source_filename => 'english.sample',
 destination_filename => 'english.profile', frequency_cutoff => 300,
 ngram_max_length => 5).

 Purpose : This function creates a language profile for future
 comparision to text, its best to pass in a good 10k to 20k byte
 sample of the language.  It reads that data, creates ngrams of
 various lengths from 1 to ngram_max_length and calculates the
 frequency of each ngram throughout the entire text.  After all of the
 ngrams have been created and sorted by their frequency it truncates
 the list to frequency_cutoff entries.

 The frequency_cuttoff serves the purpose of only keeping the ngrams
 that really aren't subject specific to the text, 300 seems to be a
 good default but its open to tuning.

 ngram_max_length is the maximum length of a ngram to be generated.
 Again this length is open to tuning, but 5 characters seems to be a
 good number.


 Returns : This function returns a hash referench of ngrams with
 ngrams as the key and their frequency rank as the value.

 Argument  :

 source_filename - the filename where to read the source text

 source_data - a scalar that contains the source text

 destination_filename - the filename where to store the profile using
 Storable.

 frequency_cutoff - the cutoff frequency of the ngrams

 ngram_max_length - the maximum length of the ngrams.

 Throws    : uses Carp::confess to complain about errors.

create_text_profile

 Usage : create_text_profile(source_filename => 'interesting.txt',
 ngram_max_length => 5)

 Purpose : This function creates the comparison profile for an
 arbitrary piece of text, it calculates all of the ngrams for the text
 and then sorts them by frequency.

 Returns : An array reference containing the ngrams sorted by
 frequency of occurrance in the passed text.


 Argument  : 

 source_filename - The filename where to read the source text

 source_data - The data to use as the source passed as a scalar

 ngram_max_length - The maximum ngram length

 Throws    : uses Carp::confess for errors.

compare_profiles

 Usage : compare_profiles(comparison_profile
 => $compare_profile, language_profile => $lang_profile)

 Purpose : This function compares a language profile to a text profile
 and calculates a score determining if the text's ngram frequency
 matches well with the language's frequency.  This is called the
 "Out-of-Place" measure.


 Returns : An integer score measuring how much the ngrams in the text
 profile are out of place with the ngrams in the language profile.

 Argument  : 

 comparison_profile - the comparison profile created by create_text_profile()

 language_profile - the language profile created by create_language_profile()

 ngram_not_found_distance - the distance value used for ngrams not
 found in the language profile, by default this is 2 * the total
 ngrams in the language profile.


 Throws : uses Carp::confess for bad arguments.

 Comments : To determine which language the text is written in, the
 best guess of this algorithm is the language with the lowest score
 returned by this function.