The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Tree::Trie - An implementation of the Trie data structure in Perl

SYNOPSIS

 use Tree::Trie;
 use strict;

 my($trie) = new Tree::Trie;
 $trie->add(qw[aeode calliope clio erato euterpe melete melpomene mneme 
   polymnia terpsichore thalia urania]);
 my(@all) = $trie->lookup("");
 my(@ms)  = $trie->lookup("m");
 $" = "--";
 print "All muses: @all\nMuses beginning with 'm': @ms\n";
 my(@deleted) = $trie->remove(qw[calliope thalia doc]);
 print "Deleted muses: @deleted\n";
 

DESCRIPTION

This module implements a trie data structure. The term "trie" comes from the word retrieval, but is generally pronounced like "try". A trie is a tree structure (or directed acyclic graph), the nodes of which represent letters in a word. For example, the final lookup for the word 'bob' would look something like $ref->{'b'}{'o'}{'b'}{HASH(0x80c6bbc)} (the HASH being an end marker). Only nodes which would represent words in the trie exist, making the structure slightly smaller than a hash of the same data set.

The advantages of the trie over other data storage methods is that lookup times are O(1) WRT the size of the index. For sparse data sets, it is probably not as efficient as performing a binary search on a sorted list, and for small files, it has a lot of overhead. The main advantage (at least from my perspective) is that it provides a relatively cheap method for finding a list of words in a large, dense data set which begin with a certain string.

As of version 0.3 of this module, the term "word" in this documentation can refer to one of two things: either a refeence to an array of strings, or a scalar which is not an array ref. In the case of the former, each element of the array is treated as a "letter" of the "word". In the case of the latter, the scalar is evaluated in string context and it is split into its component letters. Return values of methods match the values of what is passed in -- that is, if you call lookup() with an array reference, the return value will be an array reference (if appropriate).

METHODS

new([\%options])

This is the constructor method for the class. You may optionally pass it a hash reference with a set of option => value pairs. Currently, the only option available is 'deepsearch' and its valid values are 'boolean', 'choose' or 'count'. The documentation on the 'lookup' method describes the effects of these different values.

add(word0 [, word1...wordN])

This method attempts to add words 0 through N to the trie. Returns, in list context, the words successfully added to the trie. In scalar context, returns the number of words successfully added. As of this release, the only reason a word would fail to be added is if it is already in the trie.

add_data(word0 => data0 [, word1 => data1...wordN => dataN])

This method works in basically the same way as add(), except instead of in addition to addings words to the trie, it also adds data associated with those words. Data values may be overwritten by adding data for words already in the trie. Its return value is the same and applies only to new words added to the trie, not data modified in existing words. See the note in the "known issues" section concerning the use of data associated with arrayref words.

remove(word0 [, word1...wordN])

This method attempts to remove words 0 through N from the trie. Returns, in list context, the words successfully removed from the trie. In scalar context, returns the number of words successfully removed. As of this release, the only reason a word would fail to be removed is if it is not already in the trie.

delete_data(word0 [, word1...wordN])

This method simply deletes data associated with words in the trie. It is the equivalent to perl's delete builtin operating on a hash. It returns the number of data items deleted in scalar context, or a list of words for which data has been removed, in list context. See the note in the "known issues" section concerning the use of data associated with arrayref words.

lookup(word0 [, suffix_length])

This method performs lookups on the trie. In list context, it returns a complete list of words in the trie which begin with word0. In scalar context, the value returned depends on the setting of the 'deepsearch' option. You can set this option while creating your Trie object, or by using the deepsearch method. Valid deepsearch values are:

boolean: Will return a true value if any word in the trie begins with word0. This setting is the fastest.

choose: Will return one word in the trie that begins with word0, or undef if nothing is found. If word0 exists in the trie exactly, it will be returned.

count: Will return a count of the words in the trie that begin with word0. This operation requires walking the entire tree, so can possibly be significantly slower than other options.

prefix: Will return the longest entry in the trie that is a prefix of word0. For example, if you had a list of file system mount points in your trie, you could use this option, pass in the full path of a file, and would be returned the name of the mount point on which the file could be found.

For reasons of backwards compatibilty, 'choose' is the default value of this option.

To get a list of all words in the trie, use lookup("") in list context.

If the "suffix_length" option is provided, the behavior is a little bit different: Instead of returning words from the trie, it will instead return suffixes that follow word0, and those suffixes will be no longer than the numerical value of the option. If the option's value is negative, suffixes of all lengths will be returned. This option only has effect if the call to lookup() is in list context, or if the 'deepsearch' parameter is set to either 'count' or 'choose'. It has no meaning for the other scalar deepsearch settings, and will be ignored in those cases.

For example, assume your trie contains 'foo', 'food' and 'fish'. lookup('f', 1) would return 'o' and 'i'. lookup('f', 3) would return 'oo', 'ood' and 'ish'. lookup('fo', -1) would return 'o' and 'od'. In scalar context, these calls would return what you'd expect, based on the value of deepsearch, with the 'count' and 'choose' options operating only over the set of suffixes. That is, The first call would return 2 with 'count', and either 'o' or 'i' with 'choose'.

Note that lookup("", -1) is the same as lookup("").

lookup_data(word0)

This method operates in essentially the same way as lookup(), with the exception that in list context it returns a hash of word => data value mappings and in scalar context, where lookup() would return a word, lookup_data() returns the data value associated with that word. In cases where the deepsearch setting is such that lookup() would return a number, lookup_data() will return the same number. See the note in the "known issues" section concerning the use of data associated with arrayref words.

deepsearch([option])

If option os specified, sets the deepsearch parameter. Option may be one of: 'boolean', 'choose', 'count', 'prefix'. Please see the documentation for the lookup method for the details of what these options mean. Returns the current (new) value of the deepsearch parameter.

Future Work

  • There are a few methods of compression that allow you same some amount of space in the trie. I have to figure out which ones are worth implemeting. I may end up making the different compression methods configurable.

    I have now made one of them the default. It's the least effective one, of course.

  • The ability to have Tree::Trie be backed by a "live" file instead of keeping data in memory. This is, unfortunately, more complicated than simply using TIE, so this will take some amount of work.

Known Issues

  • When working with data associated with keys in a trie, and your words are arrayrefs of tokens instead of simple strings, there can be problems. The data association API and internals all rely on using "words" as hash keys, and perl doesn't allow the (useful) use of references as hash keys. As a workaround, arrayrefs are stringified by joining their elements with the "JOINER" data member of the Trie object. By default, it is set to \0, but if your data contains \0 chracters, this may cause problems, and should probably be changed. I'm not making this a "real" settable parameter, like deepsearch, because I hope to get rid of it in favor of some better solution in the future. Or, I may simply give up and integrate it in later. We'll see.

AUTHOR

Copyright 2003 Avi Finkel <avi@finkel.org>

This package is free software and is provided "as is" without express or implied warranty. It may be used, redistributed and/or modified under the terms of the Perl Artistic License (see http://www.perl.com/perl/misc/Artistic.html)