NAME
Lingua::EN::Segment - Split English-language domain names etc. into words
SYNOPSIS
my
$segmenter
= Lingua::EN::Segment->new;
for
my
$domain
(<>) {
chomp
$domain
;
my
@words
=
$segmenter
->segment(
$domain
);
"$domain: "
,
join
(
', '
,
@words
),
"\n"
;
}
DESCRIPTION
Sometimes you have a string that to a human eye is clearly made up of many words glommed together without spaces or hyphens. This module uses some mild cunning and a large list of known words from Google to try and work out how the string should be split into words.
new
Out:
$segmenter
Returns a Lingua::EN::Segment object.
dist_dir
Out:
$dist_dir
Returns the name of the directory where distribution-specific files are installed.
segment
In:
$unsegmented_string
Out:
@words
Supplied with an unsegmented string - e.g. a domain name - returns a list of words that are most statistically likely to be the words that make up this string.
unigrams
Out: \
%unigrams
Returns a hashref of word => likelihood to appear in Google's huge list of words that they got off the Internet. The higher the likelihood, the more likely that this is a genuine regularly-used word, rather than an obscure word or a typo.
bigrams
Out: \
%bigrams
As "unigrams", but returns a lookup table of "word1 word2" => likelihood for combinations of words.
ACKNOWLEDGEMENTS
This code is based on chapter 14 of Peter Norvig's book Beautiful Data.