NAME

Lingua::EN::Segment - split English-language domain names etc. into words

SYNOPSIS

 my $segmenter = Lingua::EN::Segment->new;
 for my $domain (<>) {
     chomp $domain;
     my @words = $segmenter->segment_string($domain);
     print "$domain: ", join(', ', @words), "\n";
 }

DESCRIPTION

Sometimes you have a string that to a human eye is clearly made up of many words glommed together without spaces or hyphens. This module uses some mild cunning and a large list of known words from Google to try and work out how the string should be split into words.

new

 Out: $segmenter

Returns a Lingua::EN::Segment object.

dist_dir

 Out: $dist_dir

Returns the name of the directory where distribution-specific files are installed.

segment

 In: $unsegmented_string
 Out: @words

Supplied with an unsegmented string - e.g. a domain name - returns a list of words that are most statistically likely to be the words that make up this string.

unigrams

 Out: \%unigrams

Returns a hashref of word => likelihood to appear in Google's huge list of words that they got off the Internet. The higher the likelihood, the more likely that this is a genuine regularly-used word, rather than an obscure word or a typo.

bigrams

 Out: \%bigrams

As unigrams, but returns a lookup table of "word1 word2" => likelihood for combinations of words.

ACKNOWLEDGEMENTS

This code is based on chapter 14 of Peter Norvig's book Beautiful Data.