Laurent Dami

NAME

Search::Tokenizer - Decompose a string into tokens (words)

SYNOPSIS

  # generic usage
  use Search::Tokenizer;
  my $tokenizer = Search::Tokenizer->new(
     regex     => qr/.../,
     filter    => sub { ... },
     stopwords => {word1 => 1, word2 => 1, ... },
     lower     => 1,
   );
  my $iterator = $tokenizer->($string);
  while (my ($term, $len, $start, $end, $index) = $iterator->()) {
    ...
  }

  # usage for DBD::SQLite (with builtin tokenizers: word, word_locale,
  #   word_unicode, unaccent)
  use Search::Tokenizer;
  $dbh->do("CREATE VIRTUAL TABLE t "
          ."  USING fts3(tokenize=perl 'Search::Tokenizer::unaccent')");

DESCRIPTION

This module builds an iterator function that will progressively extract terms from a given input string. Terms are defined by a regular expression (for example \w+). Term matching relies on the builtin "global match" operator of Perl (the 'g' flag), and therefore is quite efficient.

Before being returned to the caller, terms may be filtered by an auxiliary function, for performing tasks such as stemming or stopword elimination.

A tokenizer returned from the new method is a code reference, not a regular Perl object. To use the tokenizer, just call it with a string to parse : this will return another code reference, which works as an iterator. Each call to the iterator will return the next term from the string, until the string is exhausted.

This API was explicitly designed for integrating Perl with the FTS3 fulltext search engine in DBD::SQLite; however, the API is general enough to be useful for other purposes, which is why it is published in its own, separate distribution.

METHODS

Creating a tokenizer

  my $tokenizer = Search::Tokenizer->new($regex);
  my $tokenizer = Search::Tokenizer->new(%options);

Builds a new tokenizer, returned as a code reference. The first syntax with a single Regexp argument is a shorthand for ->new(regex => $regex). The second syntax, with named arguments, has the following available options :

regex => $regex

$regex is a compiled regular expression that specifies how to match a term; that regular expression should not match the empty string (otherwise the tokenizer would enter an infinite loop). The default is qr/\w+/. Here are some examples of more advanced regexes :

  # take 'locale' into account
  $regex = do {use locale; qr/\w+/}; 

  # rely on Unicode's definition of "word characters"
  $regex = qr/\p{Word}+/;

  # words like "don't", "it's" are treated as a single term
  $regex = qr/\w+(?:'\w+)?/;

  # same thing but also with internal hyphens like "fox-trot"
  $regex = qr/\w+(?:[-']\w+)?/;
lower => $bool

If true, the term returned by the $regex is converted to lowercase (or more precisely: is "case-folded" through "fc" in Unicode::CaseFold). This option is activated by default.

filter => $filter

$filter is a reference to a function that may modify or cancel a term before it is returned to the caller. The filter takes one single argument (the term) and returns a scalar (the modified term). If the value returned from the filter is empty, then this term is canceled.

filter_in_place => $filter

Like filter, except that the filtering function directly modifies the term in its $_[0] argument instead of returning a new term. This is useful for example when building a filter from Lingua::Stem::Snowball or from Text::Transliterator::Unaccent.

stopwords => $hashref

The keys in $hashref are terms to cancel (usually : common terms for which indexing would consume lots of resources with little added value). Values in the hash should evaluate to true. Lists of stopwords for various languages may be found in the Lingua::StopWords module. Stopwords filtering is applied after the filter or filter_in_place function (if any).

Whenever a term is canceled through the filter or stopwords options, the tokenizer does not return that term to the client, but nevertheless rembembers the canceled position: so for example when tokenizing "Once upon a time" with

 $tokenizer = Search::Tokenizer->new(
    stopwords => Lingua::StopWords::getStopWords('en')
 );

we get the term sequence

  ("upon", 4,  5,  9, 1)
  ("time", 4, 12, 16, 3)

where terms "once" and "a" in positions 0 and 2 have been canceled.

Creating an iterator

  my $iterator = $tokenizer->($text);

  # loop over terms ..
  while (my $term = $iterator->()) { 
    work_with_term($term); 
  }

  # .. or loop over terms with detailed information
  while (my @term_details = $iterator->()) { 
    work_with_details(@term_details); # ($term, $len, $start, $end, $index) 
  }

The tokenizer takes one string argument and returns an iterator. The iterator takes no argument; each call returns a next term from the string, until the string is exhausted, at which point the iterator returns an empty result.

If called in a scalar context, the iterator returns just a string; if called in a list context, it returns a tuple composed from

$term

the term (after filtering)

$len

the term length

$start

the starting offset in the string where this term was found

$end

the end offset (where the search for the next term will start)

$index

the index of this term within the string, starting at 0

Length and start/end offsets are computed in characters, not in bytes (note for SQLite users : the C layer in SQLite needs byte values, but the conversion will be automatically taken care of by the C implementation in DBD::SQLite).

Beware that ($end - $start) is the length of the original term extracted by the regex, while $len is the length of the final $term, after filtering; both may differ, especially if stemming is being applied.

BUILTIN TOKENIZERS

For convenience, the following tokenizers are builtin :

Search::Tokenizer::word

Terms are "words" according to Perl's notion of \w+.

Search::Tokenizer::word_locale

Terms are "words" according to Perl's notion of \w+ under use locale.

Search::Tokenizer::word_unicode

Terms are "words" according to Unicode's notion of \p{Word}+.

Search::Tokenizer::unaccent

Like Search::Tokenizer::word_unicode, but filtered through Text::Transliterator::Unaccent to replace all accented characters by their base character.

These builtin tokenizers may take the same arguments as new(): for example

  use Search::Tokenizer;
  my $tokenizer = Search::Tokenizer::unaccent(lower => 0, stopwords => ...);

SEE ALSO

AUTHOR

Laurent Dami, <lau.....da..@justice.ge.ch>

BUGS

Please report any bugs or feature requests to bug-search-tokenizer at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tokenizer. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Search::Tokenizer

You can also look for information at:

LICENSE AND COPYRIGHT

Copyright 2010 Laurent Dami.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.




Hosting generously
sponsored by Bytemark