The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::DAWG - directed acyclic word graphs

SYNOPSIS

 use Text::DAWG;

 my $dawg=Text::DAWG::->new([qw(one two three)]);

 print "one\n"  if $dawg->match("one");   # prints something

 print "four\n" if $dawg->match("four");  # prints nothing

DESCRIPTION

Text::DAWG implements implements string set recognition by way of directed acyclic word graphs.

CONSTRUCTORS

my $dawg=Text::DAWG::->new(\@words);

Creates a new DAWG matching the strings in an array.

my $dawg=Text::DAWG::->load(\*FILEHANDLE);

Creates a new DAWG from a compact representation stored in a file, or dies if anything goes wrong. The filehandle must be opened for reading and binmoded before the call.

METHODS

$dawg->match($string);

Returns a true value if the DAWG contains the string.

$dawg->store(\*FILEHANDLE);

Stores a compact representation of the DAWG in a file. The filehandle must be opened for writing and binmoded before the call.

PEDAGOGIC METHODS

$dawg->write_dot(\*FILEHANDLE);
$dawg->write_dot(\*FILEHANDLE,\%options);

Outputs a dot language representation of the DAWG (see http://www.graphviz.org/). The filehandle must be opened for writing before the call. If the DAWG contains any non-ASCII characters, you must set an appropriate encoding as well.

You can pass a reference to a hash of options for tweaking the output. The following keys are recognised:

"" (the empty string)

The value must be a hash reference specifying global attributes for the generated digraph.

"graph"

The value must be a hash reference specifying default attributes for subgraphs.

"edge"

The value must be a hash reference specifying default attributes for edges.

"node"

The value must be a hash reference specifying default attributes for nodes. Defaults to { shape => 'circle' }.

"start"

The value must be a hash reference specifying attributes for the start node.

"match"

The value must be a hash reference specifying attributes for a matching node. Defaults to { shape => 'doublecircle' }.

"startmatch"

The value must be a hash reference specifying attributes for a matching start node. Defaults to the combination of the start and match options, with match given priority.

"chars"

The value must be a hash reference with single characters for keys and hash references for values. It specifies attributes for edges representing the given characters. The default has an entry for the space character containng { label => 'SP' }, since an edge label consisting of a single space is hard to notice.

"id"

An id for the digraph itself.

"readable"

If true, certain optimisations that reduce both the size and the readability of the output are not performed.

Node ids are positive integers, with the start node always 1.

Edges have a default label equal to the character it represents. You can override this with the chars option.

my $dawg=Text::DAWG::->new(\@words,\*FILEHANDLE);
my $dawg=Text::DAWG::->new(\@words,\*FILEHANDLE,\%options);

You can pass extra arguments to the constructor to output a dot language representation of the trie that is the un-optimised version of the DAWG. Groups of trie nodes that correspond to the same DAWG node will be clustered.

TIME AND SPACE

A Text::DAWG is always slower than a built-in Perl hash.

A Text::DAWG containing a set of strings with many common prefixes and suffixes (e.g. a dictionary of English words) may use less memory than a built-in Perl hash. However, the unoptimised trie and the optimisation process itself uses many times as much memory as the final result. Loading a stored DAWG from a file uses very little extra memory.

AUTHOR

Bo Lindbergh <blgl@stacken.kth.se>

COPYRIGHT AND LICENSE

Copyright 2011, Bo Lindbergh

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.9 or, at your option, any later version of Perl 5 you may have available.