Lingua::EN::NameLookup - a simple dictionary search and manipulation class.

Synopsis

        use Lingua::EN::NameLookup;
        $dict = new Lingua::EN::NameLookup;
        $dict->load("mydict.dat");
        $res = $dict->lookup("FOO");
        $res = $dict->ilookup("Foo");
        $dict->add("Bar");
        $dict->dump("mynewdict.dat");

Description

This class provides the ability to search and manipulate a simple dictionary. It was originally designed for checking surnames encountered during the preparation of census indices. It works best with small data sets and where the names in the data set generate many distinct soundex values. The dictionary is maintained in memory and hence the memory usage depends on the number of names.

Technique

Here's how data is stored in the dictionary:

Firstly the soundex value of the name is calculated. If there is no key in the hash with the soundex then the name is stored as a one element array. If there already is a key in the hash with the soundex the name is added to the end of the existing array. Then the array is sorted and stored back in the hash. Hence for a name such as BARLOW we might have the following in the hash:

B640 => (BARIL, BARLEY, BARLOW, BERLE,...)

Here's how we look up a name:

First the soundex of the name is calculated. If there is no key in the hash with that soundex then the name is not in the dictionary. If there is a key in the hash with that soundex then the array is retrieved and searched for the name. Since we know that the array is sorted then the search can terminate as soon as an array element greater than the name being searched for is found as we then know that it cannot be in the array. This speeds things up when the individual arrays are large.

Methods

new

Creates a dictionary object and initialises it (to be empty). Options are passed as keyword value pairs. Recognised options are:

lookup($name)

Looks up the name in the dictionary, returns true if it is found or false if it is not found.

ilookup($name)

Looks up the name in the dictionary but with a case insensitive match, returns true if it is found or false if it is not found. Not as efficient as lookup.

add($name)

Add one name to the dictionary. Probably called after lookup q.v. has failed to find a name.

dump($file)

Dumps the dictionary to a file suitable for subsequent reading by load q.v. Each line of the file looks like:

soundex name1:name2:name3...

If the file cannot be opened for writing then this method will croak.

load($file)

Load the dictionary from a file produced by dump q.v. This is more efficient than using the init method as it saves having to calculate the soundex for each name. Each line of the file looks like:

soundex name1:name2:name3...

If the file cannot be opened for reading then this method will croak.

init($file)

Initialise the dictionary from a file containing one name on each line.

If the file cannot be opened for reading then this method will croak.

print

Produce a human readable form of the dictionary on standard output. This method was originally designed for debugging but may have other uses.

report

Returns a list containing the number of keys in the hash, the number of names in the hash and the length of the longest has entry. This method was originally designed for performance testing but may have other uses.

Copyright

Copyright (c) 2002 Pete Barlow <pbarlow@cpan.org>. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 127:

=pod directives shouldn't be over one line long! Ignoring all 2 lines of content