NAME

Lingua::EN::Segment - Split English-language domain names etc. into words

SYNOPSIS


            
              
              my $segmenter = Lingua::EN::Segment->new;
for my $domain (<>) {
    chomp $domain;
    my @words = $segmenter->segment($domain);
    print "$domain: ", join(', ', @words), "\n";
}

DESCRIPTION

Sometimes you have a string that to a human eye is clearly made up of many words glommed together without spaces or hyphens. This module uses some mild cunning and a large list of known words from Google to try and work out how the string should be split into words.

new


            
              
              Out: $segmenter

Returns a Lingua::EN::Segment object.

dist_dir


            
              
              Out: $dist_dir

Returns the name of the directory where distribution-specific files are installed.

segment


            
              
              In: $unsegmented_string
Out: @words

Supplied with an unsegmented string - e.g. a domain name - returns a list of words that are most statistically likely to be the words that make up this string.

unigrams


            
              
              Out: \%unigrams

Returns a hashref of word => likelihood to appear in Google's huge list of words that they got off the Internet. The higher the likelihood, the more likely that this is a genuine regularly-used word, rather than an obscure word or a typo.

bigrams


            
              
              Out: \%bigrams

As "unigrams", but returns a lookup table of "word1 word2" => likelihood for combinations of words.

ACKNOWLEDGEMENTS

This code is based on chapter 14 of Peter Norvig's book Beautiful Data.

To install Lingua::EN::Segment, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lingua::EN::Segment

CPAN shell

perl -MCPAN -e shell
install Lingua::EN::Segment

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)