The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Stem::Snowball - Perl interface to Snowball stemmers.

SYNOPSIS

  use  Lingua::Stem::Snowball;

  my @lang = stemmers();

OO interface:

  my $lang = 'en';
  my $dict = Lingua::Stem::Snowball->new(lang => $lang);
  # Test if $lang is correct
  die $@ if ($@);
  my $locale = 'C'; 

  my $dict = Lingua::Stem::Snowball->new(lang => $lang, locale => $locale);
  my $lemm = $dict->stem($word);
  my $lemm = $dict->stem($word, \$is_stemmed);

  my $dict = Lingua::Stem::Snowball->new();
  $dict->lang($lang);
  $dict->locale($locale);
  my $lemm = $dict->stem($word);
  my @lemm = $dict->stem(\@words);

Plain interface:

  my $lemm = stem($lang, $word);
  my $lemm = stem($lang, $word, $locale);
  my $lemm = stem($lang, $word, $locale, \$is_stemmed);

DESCRIPTION

This module provides unified perl interface to Snowball stemmers (http://snowball.tartarus.org) and virtually supports various languages. It's written using C for high performance and provides OO and plain interfaces.

The motivation of developing this module was to provide a generic access to stemming algorithms for OpenFTS project - full text search engine (http://openfts.sourceforge.net).

The module is very similar with Lingua::Stem. But Lingua::Stem is written in pure perl whereas Lingua::Stem::Snowball is an XS version of the snowball stemmers.

The following stemmers are available (as of Lingua::Stem 0.70):

  |------------------------------|
  | Language     | L:S   | L:S:S | 
  |------------------------------|
  | English      | y     | y     | 
  | French       | y     | y     | 
  | Spanish      | n     | y     | 
  | Portuguese   | y     | y     | 
  | Italian      | y     | y     | 
  | German       | y     | y     | 
  | Dutch        | n     | y     | 
  | Swedish      | y     | y     | 
  | Norwegian    | y     | y     | 
  | Danish       | y     | y     | 
  | Russian      | n     | y     | 
  | Finnish      | n     | y     | 
  | Galician     | y     | n     | 
  |------------------------------|

Here is a little benchmark with examples files from the snowball distribution (with no cache):

  |---------------------------------------------------|
  | Language | Unique |          Time (s)             | 
  |          | words  | L:S:S | L:S:S | L:S   | L:S:S | 
  |          |        | @     | $     | @     | $     | 
  |---------------------------------------------------|
  | DA       | 23829  | 0.5   | 1.1   | 7.3   | 14.2  | 
  | DE       | 35033  | 0.9   | 1.9   | 64.3  | 73.5  | 
  | EN       | 30428  | 0.7   | 1.5   | 2.5   | 8.8   | 
  | FR       | 20403  | 0.6   | 1.1   | 182.7 | 188.0 | 
  | IT       | 35494  | 1.0   | 2.0   | 345.6 | 350.2 | 
  | NO       | 20628  | 0.4   | 1.0   | 14.3  | 20.6  | 
  | PT       | 32016  | 0.8   | 1.7   | 405.6 | 414.8 | 
  | SV       | 30623  | 0.0   | 0.5   | 15.9  | 25.6  | 
  |---------------------------------------------------|

Here is the same benchmark with all unique words found in the bible:

  |---------------------------------------------------|
  | EN       | 12718  | 0.3   | 0.7   | 1.0   | 3.6   | 
  |---------------------------------------------------|

METHODS

$dict = Lingua::Stem::Snowball->new

Creates a new instance of the stemmer.

The constructor takes hash style parameters. The following parameters are recognized:

lang: language (ISO code).

locale: locale.

my $stemmed = $dict->stem($word)

Returns the stemmed word for $word.

my @stemmed = $dict->stem(\@words)

Returns an array of the stemmed words contained in @words.

$dict->lang([$lang])

Accessor for the lang parameter. If there is no stemmer for $lang, the language is not changed.

$dict->locale([$locale])

Accessor for the locale parameter.

stemmers()

Returns a list of all available languages with a stemmer.

$dict->strip_apostrophes([1|0])

By default, the stemmer will not strip apostrophes for you. So, if you make the following call:

  my @words = ('The', 'Ranger\'s', 'Digest');
  my @stemmed = $dict->stem(\@words);

The result might not be what you expected (if you split(' ') a user search entry for example).

Stripping 's in perl can be a little expensive, so you can let the stemmer do it in C:

  my @words = ('The', 'Ranger\'s', 'Digest');
  $dict->strip_apostrophes(1);
  my @stemmed = $dict->stem(\@words);

This method strips 's (english) and l', d', ... (french).

REQUESTS & BUGS

Please report any requests, suggestions or bugs via the RT bug-tracking system at http://rt.cpan.org/ or email to bug-Lingua-Stem-Snowball\@rt.cpan.org.

http://rt.cpan.org/NoAuth/Bugs.html?Dist=Lingua-Stem-Snowball is the RT queue for Lingua::Stem::Snowball. Please check to see if your bug has already been reported.

COPYRIGHT

Copyright 2004-2005

Currently maintained by Fabien Potencier, fabpot@cpan.org Original authors Oleg Bartunov, oleg@sai.msu.su, Teodor Sigaev, teodor@stack.net

This software may be freely copied and distributed under the same terms and conditions as Perl.

Snowball files and stemmers are covered by the BSD license.

SEE ALSO

http://snowball.tartarus.org, Lingua::Stem